TY - JOUR
T1 - Accurately selecting block size at runtime in pipelined parallel programs
AU - Lowenthal, David K.
N1 - Funding Information:
1An earlier, abbreviated version of this paper entitled ‘‘Runtime selection of block size in pipelined parallel programs,’’ appeared in the Proceedings of the Second Merged IPPS SPDP, 1999. 2This work was supported by National Science Foundation CAREER Grant CCR-9733063. 3Department of Computer Science, The University of Georgia, Athens, Georgia 30602-7404. E-mail: dkl cs.uga.edu.
PY - 2000/6
Y1 - 2000/6
N2 - Loops that contain cross-processor data dependencies, known as DOACROSS loops, are often found in scientific programs. Efficiently parallelizing such loops is important yet nontrivial. One useful parallelization technique for DOACROSS loops is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next node in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult to make statically, is important for efficient execution. This paper describes a flexible runtime approach to choosing the block size. Rather than rely on static estimation of workload, our system takes measurements during the first two iterations of a program and then uses the results to build an execution model and choose an appropriate block size which, unlike a static choice, may be nonuniform. To increase accuracy of the chosen block size, our execution model takes intra- and inter-node performance into account. It is important to note that our system finds an effective block size automatically, without experimentation that is necessary when using a statically chosen block size. Performance on a network of workstations shows that programs that use our runtime analysis outperform those that use static block sizes by as much as 18% when the workload is unbalanced. When the workload is balanced, competitive performance is achieved as long as the initial overhead is sufficiently amortized.
AB - Loops that contain cross-processor data dependencies, known as DOACROSS loops, are often found in scientific programs. Efficiently parallelizing such loops is important yet nontrivial. One useful parallelization technique for DOACROSS loops is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next node in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult to make statically, is important for efficient execution. This paper describes a flexible runtime approach to choosing the block size. Rather than rely on static estimation of workload, our system takes measurements during the first two iterations of a program and then uses the results to build an execution model and choose an appropriate block size which, unlike a static choice, may be nonuniform. To increase accuracy of the chosen block size, our execution model takes intra- and inter-node performance into account. It is important to note that our system finds an effective block size automatically, without experimentation that is necessary when using a statically chosen block size. Performance on a network of workstations shows that programs that use our runtime analysis outperform those that use static block sizes by as much as 18% when the workload is unbalanced. When the workload is balanced, competitive performance is achieved as long as the initial overhead is sufficiently amortized.
UR - http://www.scopus.com/inward/record.url?scp=0034207513&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0034207513&partnerID=8YFLogxK
U2 - 10.1023/A:1007577115980
DO - 10.1023/A:1007577115980
M3 - Article
AN - SCOPUS:0034207513
SN - 0885-7458
VL - 28
SP - 245
EP - 274
JO - International Journal of Parallel Programming
JF - International Journal of Parallel Programming
IS - 3
ER -