Running the Linpack Benchmark on a Commodity Cluster
Adam Barlev
Office of Science, FaST Program
Prepared in partial fulfilment of the requirements of the Office of Science, Department of Energy's Faculty and Student Team under the direction of Tom Murphy of the Contra Costa College High Performance Computing Program and Charles Verboom in the Computer Technical Support Department at Lawrence Berkeley National Labs
Participant:_____________________________________
Research Advisor:_______________________________
Research Advisor:_______________________________
This work was supported by the Director, Office
of Science, Office of
Basic Energy Sciences, of the U.S. Department of Energy under Contract
No. DE-AC03-76SF00098.
Table of Contents
|
Title Page |
1 |
|
Table of Contents |
2 |
|
Abstract |
3 |
|
Background |
4 |
|
Materials |
9 |
|
Methods |
9 |
|
Results |
11 |
|
Conclusions |
11 |
|
Acknowledgements |
13 |
|
References |
14 |
|
Appendix |
15 |
Abstract
The primary goal of the project was to achieve as high a rating as possible on the Highly Parallel LINPACK (HPL) Benchmark on a Linux cluster. The secondary goal was to look at some of the factors that affected the performance of Linux clusters. The HPL is the test that determines placement on the top 500 supercomputers list. I hypothesized that the network bandwidth would be the limiting factor as we added more nodes. The FlashMobComputing bootable cluster CD was used as a platform to perform the test. The highest rating achieved was 9.233 gflop/s on 12 nodes. This is a successful application of the same methods used on computers on the top 500 list. The network bandwidth proved to have little effect at the number of processors used. The results of these tests will give users of commodity Linux clusters a rough estimate of the performance of highly parallel applications.
This
work was supported by the Director, Office of Science, of the U.S.
Department of Energy under Contract No. DE-AC03-76SF00098.
Background
The modern LINPACK tests how long it takes a computer or cluster to solve the dense matrix equation Ax = b for the vector x, within some limits of precision. Since it takes a certain number of floating point operations to solve the problem, you can find the floating-point operations per second, or flop/s, by dividing the number of floating point operations performed by the time taken to perform them. That is,
flop/s = (number of floating point operations)/(time)
Typically, flop/s are expressed as gflop/s, or billions of flop/s. The size of the matrix, n, is varied in the test to maximize the number of gflop/s. This is essentially the structure of the LINPACK benchmark.
However, the original LINPACK was just a package of software for solving matrix equations. In the appendix of the 1979 LINPACK users guide the performance of 23 common computers were tested using a 100 by 100 dense matrix. Since then, thousands of computers were evaluated with this same test. Soon, computers got faster, and the 100 by 100 matrix got too fast to time accurately. The test had to be expanded to a 1000 by 1000 matrix. In this test, there was some flexibility in how the vector-vector operations at the heart of LINPACK were implemented, giving birth to the field of LINPACK optimization.
Theres several common strategies used to solve equations of the form
Ax = b
for a matrix of size n, the most common are Gauss-Jordan elimination and Gaussian elimination with back substitution. Lets use these common algorithms to look at some BLAS routines and see how they work. In Gaussian elimination, the first step is check if the first element of the matrix isn’t zero, then to multiply the first row by the reciprocal of the first element, then multiply it by the first element of the next row, make it negative and add it to the next row. It takes 1 floating point operation to check that the first element isn’t zero, then 3 floating point operations to determine a factor, a, that is
a = -(first element of second row/first element of first row)
LINPACK would then call the level-one BLAS operation daxpy, which stands for double-precision 'a' times the vector x plus the vector y. This call takes 2n floating point operations. Daxpy is called more times than any other routine by LINPACK. Since n is usually larger than 100, its the call to daxpy that uses the most floating point operations. The above process is repeated for the next row's second element till the matrix is in an
upper triangular form. At this point, Gaussian elimination is complete. With back substitution, the vector-vector operations are complete and the solution is obtained recursively from the last element up. This takes fewer floating point operations than Gauss -Jordan elimination since in Gauss-Jordan elimination, the solution is obtained by recursively adding the lower vectors to the ones above- using daxpy. Neither of these methods are as efficient as LU factorization.
LU factorization takes advantage of the fact that its much easier to solve a triangular matrix than a square matrix. Thus, since a matrix can be factored into an upper triangular matrix and a lower triangular matrix, the system can be solved by back-substitution, whic is a method that uses the minimum of time consuming calls to BLAS. The LINPACK benchmark spends the majority of its time on the LU factorization. This reduces the number of floating point operations significantly. Besides lowering the number of operations, the way the elements of the matrix are accessed by LU factorization is smoother and requires less repetitions and calls to memory. The factorization can proceed through the matrix on a column by column basis.
So if LU factorization is better, how is it carried out? That is, for a given matrix A, how can L and U be found to satisfy the equation LU = A? The method used is very clever. First, the matrix is ‘pivoted‘. Pivoting is a word for interchanging rows. Since all changes to matrixes are done through other matrixes, the pivoting is represented by the matrix P. In general, not every matrix can be LU factorized, but every matrix can be pivoted into a form that can. Once the matrix is pivoted, elements close to zero are
removed from the diagonal and the matrix can more accurately be factored into L and U. That is, contrary to what was said earlier, its (PA) = LU. However, P isn’t an independent matrix, its simply a re-ordering. All of P's elements are either zero or one. The next step is to set up the system of equations that finds the elements of L and U from the elements of (PA). If A is an n x n matrix, it has n2 elements. L and U each have mostly zeros. The first row of L has 1 element, the next row has 2, and the last n. U has n elements in its first row, n-1 in its next row, and 1 in its last row, so combined, L and U have n(n - 1) = n2 - n elements which are unknown. Each element of (PA) is a linear combination of unknown elements of L and U. To summarize, its a linear system of equations with n2 - n unknowns and n2 knowns. When a system is overdetermined (more knowns than unknowns) that allows you to arbitrarily define n of the unknowns. The LINPACK uses this fact to define the diagonal elements of U as 1. Since this is always the case, the diagonal elements of U aren't even stored in memory, thus saving more time. The remaining elements of L and U are computed. Now lets return to our original problem: If I've found L and U, what does that tell me about the unknown vector x? The original equation to solve is Ax = b. If the elements of b are appended as an additional column of A, the entire system can be described as the n by n + 1 augmented matrix [A, b]. When pivoted, this should match the LU decomposition's augmented matrix, or P[A, b] = [[L,U], y]. While the factorization is running, the pivoting and L are both applied to b (tranforming it into y). Thus Ux = y. Remember that since U is an upper triangular matrix, the final solution is trivial.
The method illustrated above is how the problem is solved on each individual processor. But the most powerful computers in the world aren't single processor machines, they have many nodes. The property of matrix operations that allows them to be parallelized is that matrixes can be block-partitioned. A matrix can either be thought of as a two dimensional array of elements or as a two dimensional array of other matrices, or blocks. The master node can partition a matrix into equally sized blocks, each of which can be sent to one of the processor nodes. Each node can then perform an LU factorization, and the subsequent solution isn't assembled until the last step. For different architectures, different block partitioning options are available. The processors can proceed through the solution using a choice of three algorithms, left looking, right looking or Crouts method. Once the factorization is complete, the nodes broadcast to each other. Several possible topologies exist for how the information is shared, the point being to prevent any node from having to send and receive at the same time. At this stage the LINPACK becomes a test not only of a processors speed and ability to access local memory, but the latency of the network.
By analyzing the various processes that the HPL program goes through, the time the HPL takes to run can be derived and compared to the time for the serial LINPACK. In this method, Jack Dongarra derived an expression for the parallel effeciency of the HPL program, shown in figure (1). In this expression, several observationsare significant: The problem is scalable with respect to communication volume, but as more and more nodes are added, the latency of the network becomes more and more significant, leading to a slow drain of efficiency.
To summarize, 3 factors allow a high LINPACK result. First and most obviously, the number, speed and memory of the nodes. The chip architecture is also very important for throughput, with 64 bit floating point arithematics being prefferable. Second, the bandwidth and latency of the network. Both of the above factors are usually maxed out by default on a supercomputer. The third and normally untapped resource is BLAS optimization. To address this, the HPL website recommends installing the ATLAS, a set of self-optimizing BLAS libraries. The install log for ATLAS shows it unrolling and re-rolling all the loops in BLAS trying to make them run faster. The fundamental strategy that ATLAS takes is to optimize memory use. Since most systems have a small amount of fast cache and a larger amount of slower to access RAM, ATLAS tries to fit all of the BLAS routines needed for the LINPACK into the fast cache allowing them to be accessed quickly. However, a new method has recently surfaced that has increased the efficiency of the main time consumer in the LINPACK. By decreasing the number of times the processor looks in the wrong place in mermory for the data it needs to perform the next calculation, Kazushige Goto's BLAS libraries increased the Lawrence Livermore National Laboratories rating from 5.69 Tflop/s to 7.63 Tflop/s. He was also brought in on the “Big Mac” cluster at Virginia Tech to design BLAS libraries specifically for the G5 processor. However, these libraries are so new that they weren't used on the FlashMob Computing CD, my platform for running LINPACK. Pregenerated ATLAS libraries, on the other hand, were included.
Materials
To build the commodity cluster, 8 micron pc's with 2.53ghz Intel PentiumIV processors with between 256 and 1024 Mbytes of RAM were used. On one of the micron PC's, a second network interface card was installed and this machine became the master node. The cluster was first connected with category 5 cables, a 16 port 100baseT switch, and then later a 8 port gigabit switch. To run the HPL program required copies of the FlashMobComputing bootable cluster CD. To compare, the benchmark was also run on a much more powerful system, a dual-1.1Ghz 64 bit opteron rack mount with 1gigabyte of ram and a 1024kb cache size.
Methods
Running the Benchmark with the FlashMob CD was simple. Since it boots off a CD, it can run on any PC without putting its hard drive in danger of being formatted during the Linux installation. The CD is based on a Knoppix distribution called Morphix. The CD itself is only 240 Mbytes and contains only the bare essentials to run LINPACK. Besides the Linux 2.4.25 kernel, it has the BLAS libraries generated by the Automatically Tuned Linear Algebra Subroutines (ATLAS) for the PIII, PIV and ATHLON architectures. It also has LAM-MPI. Once these were installed, the creators were able to compile the HPL program. The last step was to write commands that would configure and run the test. These commands are summarized in table (1):
First put the CD in one computer to check for compatibility and to see how one computer compared to the cluster as a whole. “Standalone” was selected from the boot menu, then when the command line appeared, I entered fm_enchilada and pressed enter at the prompts. The benchmark runs twice, one small matrix followedby by one large matrix. The results to the second test are submitted.
To run as a cluster, simply reboot the system and choose “Flashmob server on an isolated network”. Make sure all network cables are plugged in and the switch is on. Then put the rest of the CD's into the other computers, and boot them. They don’t need to have monitors or keyboards, since the default boot option is a compute node. On the master node, enter fm_enchilada and check that all the nodes are detected. If some of the nodes weren’t detected, typing fm_clean and fm_enchilada found them. There was a certain amount of variability in the results of the test. Thus it was only prudent to run it several times in a row and select the best rating.
I carried out this procedure on our own 8 node cluster when it was connected with 100baseT. I tried every number of processors from 1 to 9. Then I increased the bandwidth to 1000baseT and ran the tests again. The best single processor results were on the dual opteron rack mount. Attempts were also made to run the test on several newer Micron PC's, but due to a drive incompatability, running the bootable CD was impossible. Running on 10, 11 and 13 processors seemed impossible and would crash every time. However, results were obtained for 12 nodes.
The results of the benchmark were echo’ed to the standard output, but since a lot of information is generated with each run, some of it would inevitably scroll off the screen. To write the output to a file, a floppy was inserted into the master node. CTL and ALT were held together and pressing F2 switched to the root user. The next step was to allow the regular user to write to the floppy. The command 'chmod 777 /mnt/floppy' was entered. Then to run the tests, I had to switch back to the regular user. I held hold ctl and alt, then pressed F1. The command 'script -a /mnt/floppy/log' started the recording. Then I entered fm_enchilada and proceeded as prompted. When done running tests, I typed 'exit' To view the output, 'less /mnt/floppy/log' was entered.
Results
The first run on a single Pentium yielded 1.415 gflop/s. The results were the same on a laptop with a processor of the same speed. The dual opteron system produced 2.1gflop/s. The results of my first series of tests with 100baseT networking are in table (2). The peak of 7.184 gflop/s on 9 nodes is worth mentioning. The next series at 1000baseT is in table (3). The peak result was with 12 nodes achieving 9.233 gflop/s. Both series of data are summarized by the graph in figure (2).
Conclusions
My LINPACK results surprised me. For computers that were approaching the end of their useful life, they achieved a rating of approximately 9 gflop/s. While this isn't even close to the top 500 supercomputers list, its worth comparing it to the performance per node on some of those clusters. The 9.233 gflop/s was on 12 nodes, so thats .769 gflop/s/node. The “Big Mac” cluster at Virginia tech consisted of 1100 dual-processor machines connected with infiniband networking. It placed third on the last list. The peak performance was 17.6 tflop/s, so the flop/s per node was about 16 gflop/s/node. This efficiency was achieved with top of the line nodes, each of which had 2 processors, support from Apple, Infiniband and Cisco networking, and the optimized BLAS libraries from Kazushige Goto. Considering we were within 2 orders of magnitude, I'd call my experiment a success.
However, some of my results were quite peculiar. For example, the results for 7 nodes were significantly lower than the results for 6 nodes. When I looked at the printout from the HPL program, I realized that a 7 node cluster wasnt able to divide the matrix in 2 dimensions. Since it was divided into columns, there was more boundary between partitions, and thus more communication volume. When the 8th proscessor was brought online, the score increased again.The reason was that the matrix could be divided up with less 'surface' connecting the partitions, thus leading to less communication overhead. While this result is from the LINPACK, it applies to all cluster based applications. The more divisible (the more numbers divide evenly) the number of processors, the better the processes can be divided.
The network bandwidth proved less significant than anticipated. It did increase the results by a small percentage, but a ten-fold increase in bandwidth didnt increase the results by the same factor. Several observations explain this. The parallel efficiency expression in figure (1) contains 2 terms: one governed by the bandwidth, and one governed by the latency of the network. If the problem expands by adding new nodes, and the partition size on each node stays approximately the same, the bandwidth term remains constant. Since the FlashMob CD is designed to optimize the LINPACK results, it chose problem sizes that satisfied that requirement. Thus bandwith was already close to being used to maximum efficiency at 100baseT. The term governed by latency begins to dominate the change of the efficiency as the size is increased. However, this effect was invisible at the number of nodes used.
To summarize, there were 3 main obstacles preventing a faster rating. If more compute nodes were dedicated to this project, the rating would have been significantly higher. If the newer nodes were compatible with the flashmob CD, a much higher rating would have been reached both for individual nodes and for the cluster. I've brought this issue up with Greg Benson, one of the co-creators of the FlashMob CD, and I expect it to be solved by the next release. The last major obstacle could be overcome using the new BLAS libraries by Kazushige Goto, which outperform those made by the ATLAS.
HPL is still a good estimate of the performance of a cluster on highly parallel applications, especially those heavy on linear algebra. However, Even Jack Dongarra, the pioneer of the LINPACK benchmark, is working on a new suite of tests that more closely match the application performance of a cluster, and gets the performance results for different tasks independently. The current LINPACK benchmark is a game to get the highest number. To win the game, you need more than just the fastest computer and network, you also need to finely tune the software to run as fast is possible on your specific architecture. Without that type of customization, a high placement on the top 500 list is impossible.
Acknowledgements
I'd like to thank my FaST professors Tom Murphy and Charlie Verboom for their logistical and intellectual support. I also owe thanks to Wale Soyinka and Jake E. (Jay) Krous for their knowledge of Linux. Gary Yung was instrumental to expanding the prject beyond commodity pc's. Jack Dongarra deserves credit for his HPL program. I owe a great debt to the Flashmob Computing Project at USF, headed by Greg Benson. The last thanks go to the Department Of Energy for funding the FaST program.
References
[1] Jack J. Dongarra, Piotr Luszczek, Antoine Petitet, "The LINPACK Benchmark: Past, Present, and Future," Concurrency and Computation: Practice and Experience, vol. 15, Number 9, August 2003, pp. 803-820
[2] “The Linpack Benchmark.” http://www.top500.org/lists/linpack.php
[3] “FlashMob I: A First Step Towards Practical Instant Supercomputing.” Greg Benson, www.flashmobcomputing.org
[4] Wale Soyinka, "Linux Lab Manual." copyright 2004
[5] "High Performance BLAS by Kazushige Goto." Kazushige Goto, http://www.cs.utexas.edu/users/flame/goto
[6] “HOW Virginia Tech built a supercomputer on a shoestring budget.” Susan Trulove, http://www.research.vt.edu/resmag/2004resmag/HowX.html
Appendix
|
fm_enchilada |
Automatically configures the benchmark for the number of detected nodes, the processor and the amount of memory available. |
|
|
|
fm_burrito |
Is the same as the above command, but allows you to specify the processor type if the auto-configure gets it wrong. |
|
|
|
|
|
|
|
|
fm_clean |
Removes the file that contains the IP addresses of the slave nodes. This allows the "fm_enchilada" command to regenerate this file for a new set of nodes. Run "fm_clean" every time you want to change nodes. |
Table (1), the commands used to operate the FlashMob CD.
|
Number of nodes |
Partition method |
1st run |
2nd run |
|
1 |
1 x 1 |
1.405 |
1.406 |
|
2 |
1 x 2 |
2.146 |
2.127 |
|
3 |
1 x 3 |
2.473 |
2.491 |
|
4 |
2 x 2 |
2.896 |
2.934 |
|
5 |
1 x 5 |
3.179 |
3.283 |
|
6 |
2 x 3 |
4.606 |
4.801 |
|
7 |
1 x 7 |
4.294 |
3.990 |
|
8 |
2 x 4 |
6.289 |
6.126 |
|
9 |
3 x 3 |
7.184 |
6.967 |
Table (2), the results of the HPL program with a 100baseT network.
|
Number of nodes |
Partition method |
1st run |
2nd run |
|
1 |
1 x 1 |
1.415 |
1.416 |
|
2 |
1 x 2 |
2.351 |
2.351 |
|
3 |
1 x 3 |
2.794 |
2.791 |
|
4 |
2 x 2 |
3.960 |
3.970 |
|
5 |
1 x 5 |
3.609 |
3.617 |
|
6 |
2 x 3 |
5.378 |
5.376 |
|
7 |
1 x 7 |
4.589 |
4.535 |
|
8 |
2 x 4 |
6.424 |
6.415 |
|
9 |
3 x 3 |
6.985 |
7.017 |
Table (3), the results of the HPL program over a gigabit network.

|
|
|
|
|
|
Figure (1), the expression
for the parallel eficiency derived by reference [1]. The inputs are as follows:
Alpha and beta are the network latency and bandwidth, respectively. P and Q are
the numbers of times the matrix is divided in each dimension of the matrix,
with P being division by columns and Q division by rows. N is the order of the
matrix, NB is the size of each block the matrix is divided into. The
gamma-sub-3 represents the efficiency of the BLAS routines that operate
matrix-matrix interactions.
Figure (2), a graph of the results from tables (2) and (3)