From pieper@phy.anl.gov Fri Mar 4 14:13:49 2005 Date: Sun, 27 Feb 2005 21:48:51 -0600 From: Steven Pieper Reply-To: spieper@anl.gov To: discuss@bgl.mcs.anl.gov Subject: [bgl-discuss] Timing results for a nuclear Monte Carlo application Timing results for an 8Li nuclear Monte Carlo (moderate-sized nucleus) on 1/2 rack: The size of the Monte Carlo calculation is determined by two parameters: 1) The number of samples that one starts with. These samples are distributed equally to the processors. Thus the memory and time required on each processor increases with the number of samples. In addition, the results tend to come back to the master in bursts so the number of messages that the master has to process in a burst also increases with the number of samples. The number of samples fluctuates stochastically and thus the work on each slave varies. In relative terms this is 1/sqrt(number-of-samples) so increasing the number of samples can reduce lost time due to some processors running out of work. 2) The number of steps done for each sample. This mainly just increases the total time of the calculation; so once the calculation is long enough to be timed, different numbers of steps can be easily compared. The case given here was also run on Seaborg with 512 processors. The BGL 1024-processor results are 512 nodes in vn mode. Seaborg .................. BGL ................... number of processors 512 512 1024 1024 1024 1024 number of samples 25,000 25,000 25,000 50,000 50,000 100,000 number of steps done 1200 200 200 200 200 200 MPI_SEND or MPI_SSEND SEND SEND SEND SEND SSEND SSEND speed/processor (MFLOPS) 429 266 265 265 265 265 speedup 491 496 895 921 980 961 parallel efficiency 95.9% 96.8% 87.5% 90.0% 95.8% 94.0% speed/wall-clock (GFLOPS) 205. 125. 221. 228. 243. 233. Total run time (minutes) 156 45.5 26.4 51.5 48.3 100.3 Total run time for 200 steps 26 45.5 26.6 51.5 48.3 100.3 The Seaborg column can be compared directly to the 1st BGL column. The remaining BGL columns show very good scaling if the number of samples is increased with the number of processors, which is reasonable -- otherwise the 1/sqrt(N) fluctuations in load mentioned above get too large. The two 50,000-sample columns show that MPI_SSEND does not hurt (I don't know if I believe it really helps). MPI_SSEND was necessary to run the last column, otherwise the job failed with an eager buffer problem on the master. The main (only) issue with getting better performance is increasing the single-processor speed of only 265 MFLOPS. At present the calculation is using only one of the two FPUs because every attempt to get the compiler to use both FPUs only slowed things down. So somehow I hope it should be possible to approximately double these speeds. steve - -------------------------------------------------------------------- To add or remove yourself from this mailing list, use the 'notifyme' command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.