Accurately selecting block size at runtime in pipelined parallel programs

Research output: Contribution to journalArticlepeer-review

15 Scopus citations

Abstract

Loops that contain cross-processor data dependencies, known as DOACROSS loops, are often found in scientific programs. Efficiently parallelizing such loops is important yet nontrivial. One useful parallelization technique for DOACROSS loops is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next node in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult to make statically, is important for efficient execution. This paper describes a flexible runtime approach to choosing the block size. Rather than rely on static estimation of workload, our system takes measurements during the first two iterations of a program and then uses the results to build an execution model and choose an appropriate block size which, unlike a static choice, may be nonuniform. To increase accuracy of the chosen block size, our execution model takes intra- and inter-node performance into account. It is important to note that our system finds an effective block size automatically, without experimentation that is necessary when using a statically chosen block size. Performance on a network of workstations shows that programs that use our runtime analysis outperform those that use static block sizes by as much as 18% when the workload is unbalanced. When the workload is balanced, competitive performance is achieved as long as the initial overhead is sufficiently amortized.

Original languageEnglish (US)
Pages (from-to)245-274
Number of pages30
JournalInternational Journal of Parallel Programming
Volume28
Issue number3
DOIs
StatePublished - Jun 2000
Externally publishedYes

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Information Systems

Fingerprint

Dive into the research topics of 'Accurately selecting block size at runtime in pipelined parallel programs'. Together they form a unique fingerprint.

Cite this