TY - GEN
T1 - The Case of Performance Variability on Dragonfly-based Systems
AU - Bhatele, Abhinav
AU - Thiagarajan, Jayaraman J.
AU - Groves, Taylor
AU - Anirudh, Rushil
AU - Smith, Staci A.
AU - Cook, Brandon
AU - Lowenthal, David K.
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - Performance of a parallel code running on a large supercomputer can vary significantly from one run to another even when the executable and its input parameters are left unchanged. Such variability can occur due to perturbation of the computation and/or communication in the code. In this paper, we investigate the case of performance variability arising due to network effects on supercomputers that use a dragonfly topology-specifically, Cray XC systems equipped with the Aries interconnect. We perform post-mortem analysis of network hardware counters, profiling output, job queue logs, and placement information, all gathered from periodic representative application runs. We investigate the causes of performance variability using deviation prediction and recursive feature elimination. Additionally, using time-stepped performance data of individual applications, we train machine learning models that can forecast the execution time of future time steps.
AB - Performance of a parallel code running on a large supercomputer can vary significantly from one run to another even when the executable and its input parameters are left unchanged. Such variability can occur due to perturbation of the computation and/or communication in the code. In this paper, we investigate the case of performance variability arising due to network effects on supercomputers that use a dragonfly topology-specifically, Cray XC systems equipped with the Aries interconnect. We perform post-mortem analysis of network hardware counters, profiling output, job queue logs, and placement information, all gathered from periodic representative application runs. We investigate the causes of performance variability using deviation prediction and recursive feature elimination. Additionally, using time-stepped performance data of individual applications, we train machine learning models that can forecast the execution time of future time steps.
KW - data analytics
KW - dragonfly network
KW - forecasting
KW - machine learning
KW - performance models
KW - performance variability
UR - http://www.scopus.com/inward/record.url?scp=85088893783&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85088893783&partnerID=8YFLogxK
U2 - 10.1109/IPDPS47924.2020.00096
DO - 10.1109/IPDPS47924.2020.00096
M3 - Conference contribution
AN - SCOPUS:85088893783
T3 - Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium, IPDPS 2020
SP - 896
EP - 905
BT - Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium, IPDPS 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 34th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2020
Y2 - 18 May 2020 through 22 May 2020
ER -