TY - GEN
T1 - Jigsaw
T2 - 30th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2021
AU - Smith, Staci A.
AU - Lowenthal, David K.
N1 - Funding Information:
This material is based upon work supported by the National Science Foundation under Grant No. 1526015. We are also indebted to the following people for their helpful feedback: Abhinav Bhatele, Bronis de Supinski, Kate Isaacs, Nikhil Jain, Michelle Strout, Xin Yuan, and the anonymous reviewers. In addition, Stephen Herbein provided us with the Cab traces used in our experiments.
Publisher Copyright:
© 2020 ACM.
PY - 2021/6/21
Y1 - 2021/6/21
N2 - Jobs on HPC clusters can suffer significant performance degradation due to inter-job network interference. Approaches to mitigating this interference primarily focus on reactive routing schemes. A better approach - -in that it completely eliminates inter-job interference - -is to implement scheduling policies that proactively enforce network isolation for every job. However, existing schedulers that allocate isolated partitions lead to lowered system utilization, which creates a barrier to adoption. Accordingly, we design and implement Jigsaw, a new job-isolating scheduling approach for three-level fat-trees that overcomes this barrier. Jigsaw typically achieves system utilization of 95-96%, while guaranteeing dedicated network links to jobs. In scenarios where jobs experience even modest performance improvements from interference-freedom, Jigsaw typically leads to lower job turnaround times and higher throughput than traditional job scheduling. To the best of our knowledge, Jigsaw is the first scheduler to eliminate inter-job network interference while maintaining high system utilization, leading to improved job and system performance.
AB - Jobs on HPC clusters can suffer significant performance degradation due to inter-job network interference. Approaches to mitigating this interference primarily focus on reactive routing schemes. A better approach - -in that it completely eliminates inter-job interference - -is to implement scheduling policies that proactively enforce network isolation for every job. However, existing schedulers that allocate isolated partitions lead to lowered system utilization, which creates a barrier to adoption. Accordingly, we design and implement Jigsaw, a new job-isolating scheduling approach for three-level fat-trees that overcomes this barrier. Jigsaw typically achieves system utilization of 95-96%, while guaranteeing dedicated network links to jobs. In scenarios where jobs experience even modest performance improvements from interference-freedom, Jigsaw typically leads to lower job turnaround times and higher throughput than traditional job scheduling. To the best of our knowledge, Jigsaw is the first scheduler to eliminate inter-job network interference while maintaining high system utilization, leading to improved job and system performance.
KW - fat-tree
KW - inter-job network interference
KW - scheduling
KW - utilization
UR - http://www.scopus.com/inward/record.url?scp=85109520611&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85109520611&partnerID=8YFLogxK
U2 - 10.1145/3431379.3460635
DO - 10.1145/3431379.3460635
M3 - Conference contribution
AN - SCOPUS:85109520611
T3 - HPDC 2021 - Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing
SP - 201
EP - 213
BT - HPDC 2021 - Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing
PB - Association for Computing Machinery, Inc
Y2 - 21 June 2021 through 25 June 2021
ER -