New Smarts from Off-the-Shelf Parts: Modifying a Photon Simulation to Run on a Linux Cluster
Martin Robel
Office of Science, Faculty and Student Team (FaST) Program
Contra Costa College
Lawrence Berkeley National Laboratory
Berkeley, California
August 11th, 2004
Prepared in partial fulfillment of the requirements of the Office of Science, Department of Energy's Faculty and Student Team under the direction of Tom Murphy of the Contra Costa College High Performance Computing Program and Charles Verboom in the Computer Technical Support Department at Lawrence Berkeley National Labs.
Participant: _______________________________
Signature
Research Adviser: ______________________________
Signature
This work was supported by the
Director, Office of Science, Office of
Basic Energy Sciences, of the U.S. Department of Energy under Contract
No. DE-AC03-76SF00098.
Table of Contents
Abstract iii.
Introduction 1.
Materials and Methods 2.
Results 5.
Discussion and Conclusions 6.
Acknowledgments 7.
References 8.
Tables 8.
Figures 9.
ABSTRACT
Development and Deployment of High Performance Computing Using Linux Clusters. MARTIN ROBEL (Contra Costa College, San Pablo CA 94806) TOM MURPHY (Contra Costa College, San Pablo, CA 94806) CHARLIE VERBOOM (Lawrence Berkeley National Laboratory, Berkeley, CA 94720).
Cluster computing with commodity components and the Linux operating system offers a relatively low cost and readily accessible alternative to supercomputers and has the potential to significantly broaden the availability of high performance computing. Using Linux further reduces expense and provides a flexible and modular platform. To use a cluster to run an existing modeling program, the code must be modified for parallelization. Message Passing Interface (MPI) is a popular parallelizing protocol which was used for this project. The modeling code selected for parallelization, “Tiny Monte Carlo” [1], simulates heat distribution in a spherically symmetrical, isotropic material from a central point source using a Monte Carlo simulation. Changes were made to the code to incorporate MPI and make it run efficiently on a cluster. The best and easiest applications for parallelization are so called “embarrassingly parallel” problems. This Monte Carlo simulation is such an application, since the “photons” into which the total heat from the source is divided do not interact with each other. Consequently, the calculation can be simply divided up and passed out to multiple computers. Performance testing of the parallelized version of “Tiny Monte Carlo” illustrated the importance of both the total number of calculations and the cluster communication latency in selecting an appropriate sized cluster for a given problem. Specifying a photon count of 10,000 (the number specified in the original program) was fastest when two nodes shared the job, but increasing the number of nodes slowed the program execution. To capitalize on the expected speed improvement, a graphical user interface which immediately updates changes to simulation parameters was designed. Visualization was intended to be extended into a fourth dimension by rapidly recalculating and redisplaying the heat distribution upon changes to parameters. However, the effects of latency were too large to realize the anticipated speed improvement; the fastest execution times for the original number of photons run on the cluster were nearly twice as long as for the unmodified program. Hence total number of operations, amount of data transfer, cluster size, target application, and network hardware should all be considered when deciding which programs are well suited to parallelization.
INTRODUCTION
The utility of certain data intensive models is limited by the time and memory demands placed on available computing resources. Because of long execution times, the resulting cycle of input, process and display is analogous to a batch process in manufacturing. One goal of this research was to investigate the process and effectiveness of modifying an existing computer model and its interface to produce something more analogous to a continuous flow process. Vector algorithms are the most straight-forward way to translate a mathematical model into an executable computer code. But in many. cases, rudimentary knowledge of parallel programming methods and protocols can allow the researcher to modify an existing algorithm to execute in parallel on a cluster. Depending on the degree to which a given problem is parallelizable, dramatic reductions in execution time and memory demands may be achievable by running on a cluster of commodity components. By shortening execution time, rapidly refreshing graphical presentation of data upon adjustments to parameters through a graphical user interface should make it possible for the researcher to glean additional insights into the nature of the system under study from an existing model.
In this study, a compact Monte Carlo Simulation code called “Tiny Monte Carlo” [1] written in C was taken from its original, vector form and modified in stages. The Monte Carlo simulation was specifically targeted because it is well suited to parallelization and because it has a variety of engineering applications. A good example of an application for the Monte Carlo simulation is the PEREGRINE program developed at Lawrence Livermore National Lab, an advanced dose simulation model for radiation therapy applications. The graphical interface for the modified “Tiny Monte Carlo” was inspired in part by the SHODOR foundation's “function flier,” [2] a demonstration java applet which allows the user to move a slider to change the value of the input parameter and see the graph of a function immediately updated.
MATERIALS AND METHODS
To develop the code, a Micron PC with a 2.53GHz Pentium 4 processor and 256MB RAM running Linux was used. The code was compiled with the Gnu C Compiler (GCC) and the Message Passing Interface C Compiler (MPICC). The graphical user interface was built with Glade, a GTK integrated development environment. To test cluster execution, the program was run on a cluster of seven 2.53GHz Micron PC's and one 2 processor, 1.3GHz, AMD Opteron machine with 1GB of RAM using both the Bootable Cluster CD and the Warewulf Cluster Toolkit.
A crash course in Linux, primarily consisting of lab exercises was a prerequisite to this project [3]. The Redhat, cAos, Knoppix, and CentOS Linux distributions were used.
The next stage involved selecting a suitable code for the project. The goal was to find something related to Nuclear Engineering, of manageable size for the available time, and written in C or C++. The Monte Carlo simulation technique has been used in Nuclear Engineering to, for example, “model neutron behavior in ... reactor core[s]” [4] and in radiation therapy to calculate dose [5]. Scott Prahl's “Tiny Monte Carlo,” a radiative transport simulation written in C, was identified as having all three of the desired qualities. “Tiny Monte Carlo” models a “1 Watt Point Source Heating in Infinite Isotropic Scattering Medium” [1]. Spherical symmetry makes it possible to reduce this 3 dimensional problem to 2 dimensions, hence the 2 dimensional array output. Radial distance from the center of the sphere corresponds to a shell. There are 101 shells, each 50 microns thick. The total power is divided up into 10,000 virtual photons. While this does not physically represent actual photon energies, quantum mechanical photon energies are not needed for this macroscopic, stochastic model, and would introduce unnecessary and impractical complications, especially the total number of required loop execution cycles. Similarly, rather than tracking each photon until it is absorbed or escapes (hence modeling a real photon more literally), a weighting factor is applied to each photon to smear its energy over its entire path. The effect of using the weighting factor is to analyze each photon as an arbitrarily divisible packet of energy, all of which is divvied into either absorption or scattering events at each interaction with the surrounding medium (i.e. for each step, scattered + absorbed =100%). The amount of energy deposited as absorptions per step is a function of the albedo (ratio of reflected to total incident radiation), itself a function of the coefficients of absorption and scattering. Starting at the center of the sphere where the source is located, the photon is launched along one axis (randomization of launch direction is unnecessary due to spherical symmetry). With each step, the photon packet loses energy to absorption until it is eventually stopped or escapes the defined outermost shell of the model. If the packet does not escape before its weight (i.e. its size) falls below 0.001 times the original value, a roulette function either extinguishes the photon or allows it to continue another step [6]. The code was compiled using GCC (Gnu C Compiler) and executed, producing a table of heat in Joules/cm3 vs. radial distance from the source.
To prepare the code for parallelization, the structure was changed from one of multiple, nested loops in a single main function to a modular structure with a separate main and calc_heat function. For a non-expert in C programming it was found to be most effective to approach this process by first building a very rudimentary program and then stepping up the complexity until a template was produced for passing arrays between the main function and the subroutine.
Similar work was required to develop a sufficient level of competence with MPI to enable instrumenting the program to pass the same array data from the slaves to the master node in the cluster (i.e. to finish parallelization). In its parallelized form, “Tiny Monte Carlo” is divided amongst the master and slaves as follows. The slaves are assigned a fraction of the total photons to calculate using the MPI size variable. The “size” variable is an integer value equal to the total number of nodes (in this case, PC's) in the cluster. The “rank” variable is different for each node and serves as a unique identifier. The purpose of the rank variable in a cluster of computers is analogous to the purpose of a human's name on a team project; the node must know its own identity to know what portion of the work load is assigned to it. The fraction of the total photons assigned to each node is simply the total photons divided by the size variable less 1 (i.e. the number of slaves). Each slave calculates an array of heat vs. radial distance one photon at a time and then sends that data to the master using M.PI protocol. The master node sums up these arrays and writes the result to a file. After each node finishes, it must be issued the MPI_Finalize command. It was found that simply issuing this command at the end of the program was not sufficient, and produced errors which were eventually corrected by precise placement of two MPI_Finalize commands in the code, one following the slaves' MPI_Send operation, and another near the end of the main routine. It should be noted that in order to avoid complexity in this study, the master node was not given a portion of the Monte Carlo simulation to run. This, of course, has a negative impact on efficiency; the smaller the cluster, the greater the impact. Instead, the master was only given the task of adding up the heat arrays as they were reported back from the slaves.
After parallelization, the final goal of the project was to instrument the program with a graphical user interface (GUI) that would provide two key features: graphical display of the heat vs. radius array and sliders to allow the user adjust input parameters and trigger execution with each adjustment. The Glade Interface Designer GTK+ 2.0 GUI builder was selected to develop this front end. Additional C programming was required to integrate the GUI and the code. The strategy for integration of the GUI and the code was as follows. Movement of the slider triggers a system command to execute the Monte Carlo Simulation. The call to execute includes a number of parameters –the coefficient of absorption, the coefficient of scattering, the number of photons, and the number of microns per shell. Each parameter is specified by the location of a slider on the GUI. The output of the Monte Carlo program is written to a file which is then read by the graphing function within the GUI and the graph is updated, thus competing the cycle.
RESULTS
Varying the number of photons in the simulation had a key effect on execution time for a given sized cluster. Specifying 10,000 photons, the same number as in the original, non-parallelized program, produced an average execution time of 0.763 s with one slave, 0.745 s with 2 slaves, and 0.860 s with 3 slaves (see table 1). In comparison, the original vector program took an average of 0.47 s with 10,000 photons. Increasing the number of photons to 25,000 caused a shift in the optimal cluster configuration, from 2 slaves to 3 slaves. A similar shift occurs at 75,000 photons, the optimal configuration changing from 3 slaves to 4 slaves.
The GUI was left in an incomplete state at the end of the project. Modifications to the Monte Carlo program were made to take the four parameters mentioned above as arguments and write the output to a file, all without interfering with the MPI protocol. A template for producing a graph from a file containing the heat array was also produced. The front end was designed (see figure 7).
DISCUSSION AND CONCLUSIONS
Interpolation of the graphs for the cluster and the single processor code suggests a key transition occurs at around 25,000 photons (see figure 6). At this number, the single processor and the 3 node (2 slave) cluster are approximately evenly matched for speed. At any value beyond this crossover range, the cluster is faster. This result has important implications for the GUI aspect of this project. The intended purpose of the GUI was to extend the visualization of data into the time dimension by rapidly updating the graph. This requires that the parallel program shorten the minimum execution time for the program. However, because the improvements in execution speed for the parallel program occur only at higher photon counts (above 25,000), parallelization does nothing to improve visualization as envisioned with the GUI. Stated another way, the minimum execution time for the cluster is roughly twice that of a single processor running the original program at 10,000 photons. Reducing the number of photons will only widen that disparity. Alternately, increasing the number of photons will slow the total cycle down, defeating the purpose of the GUI. One clear way to shorten the cycle time without reducing resolution for this model would be to run the program on a faster vector processor machine, not on a cluster.
While it may not be possible to achieve a shorter minimum cycle time with a parallel version of “Tiny Monte Carlo”, some changes which should improve performance are as follows. The master node could be given an equal portion of the work load. Changes to the network hardware may improve the cluster execution speed.
The purpose of this project was to explore a concept, not to create a useful tool. There appears to be potential to a rapidly refreshing display of data from a model that previously only yielded batch results. There are also clear advantages to using a cluster of a particular size for a particular application; there is an optimal number of nodes for each number of photons simulated in the parallelized Monte Carlo program used for this experiment. But, as the minimum cycle times show, cluster computing has important limitations. The GUI concept in this project turns out to be a good illustration of one of those limitations.
ACKNOWLEDGEMENTS
I wish to thank the U.S. Department of Energy Office of Science and Lawrence Berkeley National Lab for creating, funding and organizing the Faculty and Student Team of which I was a part. Special thanks goes to my mentor, Charlie Verboom and FaST faculty member, Tom Murphy, without whose guidance and technical expertise I could not have completed this project. I am also grateful to Wale Soyinka for his patient assistance in all matters Linux, Scott Prahl for allowing me to use his code and answering my questions about the simulation, and my colleagues Adam Barlev, Bassam Aldhafari, and Ben Rosenberg.
REFERENCES
[1] S.A. Prahl, “Tiny Monte Carlo,” Oregon Medical Laser Center; http://omlc.ogi.edu/software/mc/tiny_mc.c.
[2] “Function Flyer,” SHODOR Education Foundation; http://www.shodor.org/interactivate/activities/flyall/.
[3] W. Soyinka, “Linux Lab Manual 1, 2” Wale Soyinka, 2004.
[4] N. Bronstein, “Neutron Particles Under Scrutiny,” Berkeley Engineering FOREFRONTS, Regents of the University of California 1994, p1; http://www.nuc.berkeley.edu/neutronics/papers/gtran2.html.
[5] PEREGRINE program, Lawrence Livermore National Lab; http://www.llnl.gov/peregrine.
[6] S. A. Prahl, “Light Transport in Tissue,” PhD Thesis, University of Texas at Austin, 1988, pp. 2-27.
Tables
Average Execution time
Nodes 104 photons 105 photons 106 photons 107 photons
2 0.763 4.949 46.945 836.152
3 0.745 2.814 23.901 390.100
4 0.860 2.332 17.534 198.473
5 0.994 2.105 13.755 138.665
6 1.166 2.086 11.072 100.446
7 1.340 2.083 9.873 84.089
Table 1. Execution time for parallelized version of “Tiny Monte Carlo” with varying numbers of nodes and varying numbers of photons.
Figures
char t1[80] = "Tiny Monte Carlo by Scott Prahl (http://omlc.ogi.edu)";
char t2[80] = "1 W Point Source Heating in Infinite Isotropic Scattering Medium";
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define SHELL_MAX 101
double mu_a = 2; /* Absorption Coefficient in 1/cm !!non-zero!! */
double mu_s = 20; /* Reduced Scattering Coefficient in 1/cm */
double microns_per_shell = 50; /* Thickness of spherical shells in microns */
long i, shell, photons = 10000;
double x, y, z, u, v, w, weight;
double albedo, shells_per_mfp, xi1, xi2, t, heat[SHELL_MAX];
int main ()
{
albedo = mu_s / (mu_s + mu_a);
shells_per_mfp = 1e4/microns_per_shell/(mu_a+mu_s);
for (i = 1; i <= photons; i++)
{
x = 0.0; y = 0.0; z = 0.0; /*launch*/
u = 0.0; v = 0.0; w = 1.0;
weight = 1.0;
for (;;) {
t = -log((rand()+1.0)/(RAND_MAX+1.0)); /*move*/
x += t * u;
y += t * v;
z += t * w;
shell=sqrt(x*x+y*y+z*z)*shells_per_mfp; /*absorb*/
if (shell > SHELL_MAX-1) shell = SHELL_MAX-1;
heat[shell] += (1.0-albedo)*weight;
weight *= albedo;
for(;;) { /*new direction*/
xi1=2.0*rand()/RAND_MAX - 1.0;
xi2=2.0*rand()/RAND_MAX - 1.0;
if ((t=xi1*xi1+xi2*xi2)<=1) break;
}
u = 2.0 * t - 1.0;
v = xi1 * sqrt((1-u*u)/t);
w = xi2 * sqrt((1-u*u)/t);
if (weight < 0.001){ /*roulette*/
if (rand() > 0.1 * RAND_MAX) break;
weight /= 0.1;
}
}
}
printf("%s\n%s\n\nScattering = %8.3f/cm\nAbsorption = %8.3f/cm\n",t1,t2,mu_s,mu_a);
printf("Photons = %8ld\n\n Radius Heat\n[microns] [W/cm^3]\n",photons);
t = 4*3.14159*pow(microns_per_shell,3)*photons/1e12;
for (i=0;i<SHELL_MAX-1;i++)
printf("%6.0f %12.5f\n",i*microns_per_shell, heat[i]/t/(i*i+i+1.0/3.0));
printf(" extra %12.5f\n",heat[SHELL_MAX-1]/photons); return 0;
}
Figure 1. Original “Tiny Monte Carlo” program.
/* Tiny Monte Carlo by Scott Prahl (http://omlc.ogi.edu), modified by Martin Robel (mrobel@berkeley.edu) and Tom Murphy (http://contracosta.edu/hpc) to take arguments for mu_a, mu_s, photons, microns_per_shell, execute on a cluster using MPI and write output to a file. Added random number seed using date in microseconds. This corrected downward drift in total heat with increasing processes.*/
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <sys/time.h> /*Required for seeding random number generator with the time in usec.*/
#include <mpi.h> /*this won't compile on a box without MPI*/
#define SHELL_MAX 101 /*How many spherical shells we are dividing the sphere into*/
double mu_a = 0; /*Absorbtion coefficient in inverse centimeters. Original value is 2*/
double mu_s = 0; /*Reduced scattering coefficient in inverse centimeters. Original value is 20*/
double microns_per_shell = 0; /*Thickness of spherical shells in microns. Original value is 50*/
long photons = 0; /* Original value is 10,000*/
long i,j, shell; /*i and j used for loops, shell used for distance from the center*/
FILE *fp; /*open file for writing to*/
double x, y, z, u, v, w, weight; /*These are used for calc. of location, direction, and packet size*/
double albedo, shells_per_mfp, xi1, xi2, t, heat_total[SHELL_MAX], heat[SHELL_MAX];
struct timeval tv; /*Timeval is used to get time in microsec*/
struct timezone tz; /*Timezone is not used. It is part of the gettimeofday function.*/
/*The gettimeofday function is used solely for seeding the random num. generator.*/
void calc_heat(long photons,double heat[]){
albedo = mu_s / (mu_s + mu_a);
shells_per_mfp = 1e4 / microns_per_shell / (mu_s + mu_a);
for(i=1; i<= photons; i++){
x = 0.0; y = 0.0; z = 0.0;
u = 0.0; v = 0.0; w = 1.0;
weight = 1.0;
for(;;){ /*Loop forever until break*/
t = -log((rand() + 1.0) / (RAND_MAX + 1.0)); /*Move*/
x += t * u;
y += t * v;
z += t * w;
shell = sqrt(x*x + y*y + z*z) * shells_per_mfp; /*Absorb*/
if(shell > SHELL_MAX-1) shell = SHELL_MAX-1;
heat[shell] += (1.0 - albedo) * weight;
weight *= albedo;
for(;;){ /*New direction*/
xi1 = 2.0 * rand() / RAND_MAX - 1.0;
xi2 = 2.0 * rand() / RAND_MAX - 1.0;
if ((t = xi1*xi1 + xi2*xi2) <= 1) break;
}
u = 2.0 * t - 1.0;
v = xi1 * sqrt((1 - u*u) / t);
w = xi2 * sqrt((1 - u*u) / t);
if(weight < 0.001){ /*Roulette*/
if (rand() > 0.1 * RAND_MAX) break;
weight /= 0.1;
}
}
}
}
Figure 2. First half of “Tiny Monte Carlo” program modified to take parameters, execute on a cluster using MPI and write output to a file.
int main(int argc, char**argv){
if ((fp = fopen("tmcdata","w")) == NULL) {
printf("Cry havoc and let loose the dogs of war.\n");
exit(0);
}
MPI_Init(&argc,&argv);
long photons_per_process;
char t1[80] = "Tiny Monte Carlo Modified";
char t2[80] = "1W Point Source Heating in Infinite Isotropic Scattering Medium";
int rank, size; / *Rank is the identity of the node, size is the total number of nodes.*/
sscanf(argv[1], "%lf", &mu_a); /*Takes first argument and assigns value to mu_a*/
sscanf(argv[2], "%lf", &mu_s); /*Takes second argument...*/
sscanf(argv[3], "%ld", &photons);
sscanf(argv[4], "%lf", µns_per_shell);
gettimeofday(&tv, &tz); /*Gets the present date in usec, stores it as tv */
srand(tv.tv_usec); /*Seeds the random number generator, rand(), with tv*/
MPI_Status status;
MPI_Comm_rank(MPI_COMM_WORLD,&rank); /*Assigns rank to each node*/
MPI_Comm_size(MPI_COMM_WORLD,&size); /*Determines cluster size*/
if(size<2){
printf("Oops, -np must be at least 2 for this to run.\n");
exit(0);
}
photons_per_process = ((photons -1)/(size-1) + 1); /*Divides the workload amongst the slaves.*/
if(rank>0){ /*i.e. if you're a slave (not the master node).*/
calc_heat(photons_per_process, heat);
MPI_Send(heat,SHELL_MAX,MPI_DOUBLE,0,0,MPI_COMM_WORLD); /*Slaves send results to master node.*/
MPI_Finalize(); /* Slaves must be told to stop! */
}
else{ /*For master node*/
for(j=0;j<=SHELL_MAX;j++){
heat_total[j] = 0; /*receive and add up the arrays,*/
}
for(i=1;i<size;i++){
MPI_Recv(heat,SHELL_MAX,MPI_DOUBLE,i,0,MPI_COMM_WORLD,&status);
for(j=0;j<=SHELL_MAX;j++){
heat_total[j] += heat[j];
}
}
fprintf(fp, "%s\n%s\n\nScattering = %8.3f/cm\nAbsorption = %8.3f/cm\n",t1,t2,mu_s,mu_a);
fprintf(fp, "Photons = %8d\n\n Radius Heat\n[microns] [W/cm^3]\n", photons);
t = 4*M_PI * pow(microns_per_shell,3) * photons / 1e12;
for(i = 0; i<SHELL_MAX-1; i++)
fprintf(fp, "%6.0f,%12.5f\n", i*microns_per_shell, heat_total[i] / t / (i*i + i + 1.0 / 3.0));
fprintf(fp, " extra, %12.5f\n", heat_total[SHELL_MAX-1]/photons);
MPI_Finalize();
return 0;
}
fclose(fp);
}
Figure 3. Second half of “Tiny Monte Carlo” program modified to take parameters, execute on a cluster using MPI and write output to a file.
![]() |
![]() |
![]() |
Figure 6. Execution time vs. number of photons for various configurations, where “Original” is the unmodified “Tiny Monte Carlo” program.

Figure 7. Graphical user interface front end design.