TY - GEN
T1 - Adaptive power reallocation for value-oriented schedulers in power-constrained HPC
AU - Kumbhare, Nirmal
AU - Marathe, Aniruddha
AU - Akoglu, Ali
AU - Hariri, Salim
AU - Abdulla, Ghaleb
N1 - Funding Information:
This work is partly supported by National Science Foundation (NSF) research projects NSF CNS-1624668. A part of this work is also performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 ((LLNL-JRNL-780060).
Publisher Copyright:
© 2019 IEEE.
PY - 2019/12
Y1 - 2019/12
N2 - In the exascale era, HPC systems are expected to operate under different system-wide power-constraints. For such power-constrained systems, improving per-job flops-per-watt may not be sufficient to improve the total HPC productivity as more number of scientific applications with different compute intensities are migrating to the HPC systems. To measure HPC productivity for such applications, we utilize a monotonically decreasing time-dependent value function, called job-value, with each application. A job-value function represents the value of completing a job for an organization. We begin by exploring the trade-off between two commonly used static power allocation strategies (uniform and greedy) in a power-constrained oversubscribed system. We simulate a large-scale system and demonstrate that, at the tightest power constraint, the greedy allocation can lead to 30% higher productivity compared to the uniform allocation whereas, the uniform allocation can gain up to 6% higher productivity at the relaxed power constraint. We then propose a new dynamic power allocation strategy that utilizes power-performance models derived from offline data. We use these models for reallocating power from running jobs to newly arrived jobs to increase overall system utilization and productivity. In our simulation study, we show that compared to static allocation, the dynamic power allocation policy improves node utilization and job completion rates by 20% and 9%, respectively, at the tightest power constraint. Our dynamic approach consistently earns up to 8% higher productivity compared to the best performing static strategy under different power constraints.
AB - In the exascale era, HPC systems are expected to operate under different system-wide power-constraints. For such power-constrained systems, improving per-job flops-per-watt may not be sufficient to improve the total HPC productivity as more number of scientific applications with different compute intensities are migrating to the HPC systems. To measure HPC productivity for such applications, we utilize a monotonically decreasing time-dependent value function, called job-value, with each application. A job-value function represents the value of completing a job for an organization. We begin by exploring the trade-off between two commonly used static power allocation strategies (uniform and greedy) in a power-constrained oversubscribed system. We simulate a large-scale system and demonstrate that, at the tightest power constraint, the greedy allocation can lead to 30% higher productivity compared to the uniform allocation whereas, the uniform allocation can gain up to 6% higher productivity at the relaxed power constraint. We then propose a new dynamic power allocation strategy that utilizes power-performance models derived from offline data. We use these models for reallocating power from running jobs to newly arrived jobs to increase overall system utilization and productivity. In our simulation study, we show that compared to static allocation, the dynamic power allocation policy improves node utilization and job completion rates by 20% and 9%, respectively, at the tightest power constraint. Our dynamic approach consistently earns up to 8% higher productivity compared to the best performing static strategy under different power constraints.
KW - Cloud computing
KW - HPC productivity
KW - High performance computing
KW - Power-aware scheduling
KW - Power-constrained computing
KW - Value heuristics
UR - http://www.scopus.com/inward/record.url?scp=85083280559&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85083280559&partnerID=8YFLogxK
U2 - 10.1109/PDCAT46702.2019.00035
DO - 10.1109/PDCAT46702.2019.00035
M3 - Conference contribution
AN - SCOPUS:85083280559
T3 - Proceedings - 2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019
SP - 133
EP - 139
BT - Proceedings - 2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019
A2 - Tian, Hui
A2 - Shen, Hong
A2 - Tan, Wee Lum
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019
Y2 - 5 December 2019 through 7 December 2019
ER -