Parallelization of an N-body Simulator Code using the
Message Passing Interface (MPI)
Bassam A. Aldhafari
Office of Science, FaST Program
Contra Costa College
Lawrence Berkeley National Laboratory
Berkeley, California
August 11th , 2004
Prepared in partial fulfilment of the requirements of the Office of Science, Department of Energy's Faculty and Student Team under the direction of Tom Murphy of the Contra Costa College High Performance Computing Program and Charles Verboom in the Computer Technical Support Department at Lawrence Berkeley National Labs
Participant:_____________________________________
Research Advisor:_______________________________
This work was
supported by the Director, Office of Science, Office of
Basic Energy Sciences, of the U.S. Department of Energy under
Contract
No. DE-AC03-76SF00098.
Contents
Page
Abstract 3
Introduction/Background 4-5
Materials/Methods 6-8
Results 8-9
Discussion/Conclusion 9
Sources 10
Figures 11
Abstract
The project pursued is a computer programming effort, completed with a C++ code that simulates the N-body problem. A starter code of the N-body problem written by Pict Hut and Jun Makino is changed in such a way that it can be compiled and executed on a cluster of Pentium 4 processors. After a series of code transformations the code was parallelized using the Message Passing Interface (MPI) and then run on a cluster containing up to seven processors. The processing time of running the original code on a single processor and that running the parallelized code on the cluster is recorded and compared for performance.
Introduction/Background
Supercomputers are a class of the most powerful computers available at any given time. Development of supercomputers began as early as the 1970's and are in major development to the present day. Over $1 billion funds development of supercomputers yearly in the United States. These powerful machines are now capable of modeling some of the most complex problems sought out by researchers across the globe. Supercomputers are used in a wide range of areas, from modeling the explosion of a nuclear weapon to that of a Supernova in space, making them an essential tool in researching cutting edge science today. Over the years supercomputers have been put together using various techniques. The costly Earth Simulator of Japan, running at 35 Teraflops, is an example of a powerful supercomputer used today to predict earth climate changes.
In 1965 Gordon Moore, co-founder of Intel, predicted that the number of transistors per square inch on an integrated circuit (chip) would double every couple of years [3], also known today as Moore's law. The result is an exponential growth in the performance and speed of processors. The Pentium processor built in 1993 contained around 3 million transistors as opposed to todays' processors which contain well over 40 million transistors [3]. Enhancing processors results in a more powerful computational machines which use these processors.
The project pursued is a parallelization of a C++ code which initially runs on a single processor. Parallelizing the code allows more than one processor to do part of the work, which allows the division of tasks amongst the processors on the cluster. The computation to be done is a prediction of the particles future positions due to external forces, given initial conditions of mass, position, and velocity. As known, there are solutions to the one and two body problems. The study of particle motion states that a body remains in place if no net force is acting on it, and moves in the direction of the net force if it exists. The solution to the two body problem states that the acceleration of an object as produced by a net force is directly proportional to the magnitude of the net force, in the same direction as the net force, and inversely proportional to the mass of the object [8]. The N-body simulation makes a prediction on where an n number of particles would be after a given time period. It does not give a solution to the problem but an estimate on what might the solution look like. Initial conditions to the problem include the velocity, location to some reference, and mass of each of the particles. The particles in the simulation can model any two bodies that have mass and occupy space, and are used mostly to model stars and large bodies in space. Parallelization of the code enables the simulation to run on a cluster of processors thus producing results faster than when running the simulation on a single processor. Transformation of the program to run on a number of processors was done in stages, which will finally use the Message Passing Interface (MPI) to communicate between the slave nodes and the master node on the cluster. Each node on the cluster has a copy of the code and is assigned by the master node specific tasks to complete as part of the simulation. Theoretically, increasing the number of nodes (processors) would decrease the time in which the simulation is done.
Materials/Methods:
To be able to transform the N-body simulator code from running on a single processor to a number of processors, a series of code transformations were done. The starter code of the N-body problem [1], the g++ compiler running on a Redhat 9.2 Linux distribution and various cluster platforms (BCCD, Warewulf ..) were the materials used to complete the project. Using an xdiff program that compares certain changes made to the code in each transformation provides an efficient way of debugging mistakes made when changing the code. The timing software (time command from the Bootable Cluster CD) records how much time the cluster was engaged in completing the simulation. The OpenOffice Calc application was used to display the output of each code and then compare results due to changes made to the code.
The process of transformation of the code involved five stages. The first stage was working with a simple code (printing “Hello” as output) provided in the Bootable Cluster CD [2], where it was parallelized to run on a number of processors as opposed to a single one. This code is compiled and run on a single processor. The code is then parallelized using basic Message Passing Interface (MPI) functions. A simple parallelization process done on a basic code which prints hello on each node of the cluster, provided insight on parallelizing a more complex code such as that of the N-body simulator code, which was much more involved in programming.
Stage two of the process analyzes the N-body code and helped determine what task should be parallelized. A particular routine, which has the single task of doing a major calculation in the code is chosen and separated into an individual function, creating an additional function to the code. This function included changes that dealt with the types of variables that were to be passed in and out of the routine, thus creating the same result of the previous stage of the code.
Before the implementation of the Message Passing Interface functions to the code an intermediate step was taken. Because the code is difficult to debug when MPI functions are implemented to the code, testing variables replaced the two MPI functions MPI_Rank and MPI_Size. MPI_Size stores the number of processors that will be used in the cluster while MPI_Rank assigns each of the processors a rank starting with zero and ending with one less than size. The code was changed such that it sends each rank and equal amount of task to operate and returned the correct results to the main routine that is in charge of collecting the data from each rank and continuing the execution. The changed code was compiled and the output was compared to the previous using the xdiff application, resulting in the same output.
The fourth stage of transforming the code involved a number of steps. One was to have each node read in the first initial conditions independently and not receive them from the master node on the cluster. Secondly, a modification was made in order to have each node receive an updated set of data containing the position and velocity of each particle, which each node then uses to produce the accelerations, jerks, and collision times of each of the particles. Each node then returns the values to the main node to create a new set of calculations and returns a new set of position and velocity to each node to repeat the exact calculation.
The final stage of the code implements the MPI functions to the code, enabling it to run on a cluster. First of all an mpi.h header file was added to enable the use of MPI functions. MPI_Init was involved in the initialization of MPI functions to be used in the program. MPI_Rank as discussed early on, generates and sends a rank to each node on the cluster, serving as a name which the main node can define each of the nodes on the cluster. The MPI_Size function receives from the command line, the number of processors that will be operating on the cluster. Other functions such as MPI_Send and MPI_Recv were used where applicable for the transfer of data between the master node and the slave nodes on the cluster. MPI_Finalize was the last MPI function used, signaling to all nodes on the cluster to stop use of the Message Passing Interface. The results are collected as the parallelized code is run on a cluster containing up to seven processors, and that is compared with results obtained from running the original N-body starter code on a single processor system.
Results:
From the data collected it was clear that the processing time decreased when the code was run on the cluster of two processors as apposed to running on a single processor [see figures]. In general decreasing the number of particles to simulate produced a faster execution time on the single processor than it did on the cluster containing multiprocessors. As the number of particles increased, the cluster processed the data much faster than that of the single processor running the non-parallel code. The parallelized code produces shorter execution times than the that of a single processor when executed with a high number of particles.
Discussion/Conclusions:
As observed from the results it was clear that a large body problem would work well on the parallelized version of the code as compared to the starter code. Part of the parallelized code involves the passing of large amounts of data between the main node and slave nodes. With a small number of bodies, the starter code run on a single of bodies, only compiles and executes without transferring data from one node to another. On the other hand a cluster has latency problems, where data is sent across systems, and calculations are not done until all data required is collected. This is where the change in execution time is seen [figures] to decrease on a cluster when the number of bodies is large, and increase when the number of bodies is smaller (compared to that of the same number of bodies run on a single processor).
The project pursued in parallelizing code to run on a multiprocessor cluster was a process which involved analysis of the algorithm and a background in programming for a successful parallelization. Methods used in completing this project may be used in similar projects related to parallelizing code to run on a multiprocessor cluster.
Sources:
[1] Pure Gravity: Particles at play vol.1 writing an N-body code
Pict Hut and Jun Makino, 2003.
[2] Gray,Paul Gray: BCCD v 2.2
May 30, 2004
[4] http://members.forunecity.com/kokhuitan/nbody.html
[5] www.shodor.org/refdesk/resources/tutorials/runningMPI/index.php
[6] Wale Soyinka : Linux Lab Manuel
©2004
[7] Copyright © 2000, Matteo Dell'Omodarme.
Copying license http://www.linuxgazette.com/copying.html
Published in Issue 61 of Linux Gazette, January
2001
[8] http://www.glenbrook.k12.il.us/gbssci/phys/Class/newtlaws/u2l3a.html
Figures:
![]() |
Figure 2: Running 100 bodies on a number of processors
![]() |
Figure 2: Running 50 bodies on a number of processors
![]() |
![]() |