TY - JOUR
T1 - Parallelizing heavyweight debugging tools with mpiecho
AU - Rountree, Barry
AU - Gamblin, Todd
AU - De Supinski, Bronis R.
AU - Schulz, Martin
AU - Lowenthal, David K.
AU - Cobb, Guy
AU - Tufo, Henry
N1 - Funding Information:
Copyright 2012 Elsevier. Elsevier acknowledges that this contribution was authored or co-authored by a contractor or affiliate of the US Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. This work was partially performed under the auspices of the US Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. PARCO 2012.
PY - 2013
Y1 - 2013
N2 - Idioms created for debugging execution on single processors and multicore systems have been successfully scaled to thousands of processors, but there is little hope that this class of techniques can continue to be scaled out to tens of millions of cores. In order to allow development of more scalable debugging idioms we introduce mpiecho, a novel runtime platform that enables cloning of MPI ranks. Given identical execution on each clone, we then show how heavyweight debugging approaches can be parallelized, reducing their overhead to a fraction of the serialized case. We also show how this platform can be useful in isolating the source of hardware-based nondeterministic behavior and provide a case study based on a recent processor bug at LLNL. While total overhead will depend on the individual tool, we show that the platform itself contributes little: 512x tool parallelization incurs at worst 2x overhead across the NAS Parallel benchmarks, hardware fault isolation contributes at worst an additional 44% overhead. Finally, we show how mpiecho can lead to near-linear reduction in overhead when combined with maid, a heavyweight memory tracking tool provided with Intel's pin platform. We demonstrate overhead reduction from 1466% to 53% and from 740% to 14% for cg (class D, 64 processes) and lu (class D, 64 processes), respectively, using only an additional 64 cores.
AB - Idioms created for debugging execution on single processors and multicore systems have been successfully scaled to thousands of processors, but there is little hope that this class of techniques can continue to be scaled out to tens of millions of cores. In order to allow development of more scalable debugging idioms we introduce mpiecho, a novel runtime platform that enables cloning of MPI ranks. Given identical execution on each clone, we then show how heavyweight debugging approaches can be parallelized, reducing their overhead to a fraction of the serialized case. We also show how this platform can be useful in isolating the source of hardware-based nondeterministic behavior and provide a case study based on a recent processor bug at LLNL. While total overhead will depend on the individual tool, we show that the platform itself contributes little: 512x tool parallelization incurs at worst 2x overhead across the NAS Parallel benchmarks, hardware fault isolation contributes at worst an additional 44% overhead. Finally, we show how mpiecho can lead to near-linear reduction in overhead when combined with maid, a heavyweight memory tracking tool provided with Intel's pin platform. We demonstrate overhead reduction from 1466% to 53% and from 740% to 14% for cg (class D, 64 processes) and lu (class D, 64 processes), respectively, using only an additional 64 cores.
KW - Dynamic binary instrumentation
KW - Heavyweight tools
KW - MPI
UR - http://www.scopus.com/inward/record.url?scp=84875935688&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84875935688&partnerID=8YFLogxK
U2 - 10.1016/j.parco.2012.11.002
DO - 10.1016/j.parco.2012.11.002
M3 - Article
AN - SCOPUS:84875935688
SN - 0167-8191
VL - 39
SP - 156
EP - 166
JO - Parallel Computing
JF - Parallel Computing
IS - 3
ER -