TY - GEN
T1 - Inter-loop optimizations in RAJA using loop chains
AU - Neth, Brandon
AU - Scogland, Thomas R.W.
AU - de Supinski, Bronis R.
AU - Strout, Michelle Mills
N1 - Publisher Copyright:
© 2021 Association for Computing Machinery.
PY - 2021/6/3
Y1 - 2021/6/3
N2 - Typical parallelization approaches such as OpenMP and CUDA provide constructs for parallelizing and blocking for data locality for individual loops. By focusing on each loop separately, these approaches fail to leverage sources of data locality possible due to inter-loop data reuse. The loop chain abstraction provides a framework for reasoning about and applying inter-loop optimizations. In this work, we incorporate the loop chain abstraction into RAJA, a performance portability layer for high-performance computing applications. Using the loop-chain-extended RAJA, or RAJALC, developers can have the RAJA library apply loop transformations like loop fusion and overlapped tiling while maintaining the original structure of their programs. By introducing targeted symbolic evaluation capabilities, we can collect and cache data access information required to verify loop transformations. We evaluate the performance improvement and refactoring costs of our extension. Overall, our results demonstrate 85-98% of the performance improvements of hand-optimized kernels with dramatically fewer code changes.
AB - Typical parallelization approaches such as OpenMP and CUDA provide constructs for parallelizing and blocking for data locality for individual loops. By focusing on each loop separately, these approaches fail to leverage sources of data locality possible due to inter-loop data reuse. The loop chain abstraction provides a framework for reasoning about and applying inter-loop optimizations. In this work, we incorporate the loop chain abstraction into RAJA, a performance portability layer for high-performance computing applications. Using the loop-chain-extended RAJA, or RAJALC, developers can have the RAJA library apply loop transformations like loop fusion and overlapped tiling while maintaining the original structure of their programs. By introducing targeted symbolic evaluation capabilities, we can collect and cache data access information required to verify loop transformations. We evaluate the performance improvement and refactoring costs of our extension. Overall, our results demonstrate 85-98% of the performance improvements of hand-optimized kernels with dramatically fewer code changes.
KW - C++
KW - Data locality
KW - Loop chains
KW - Performance portability
KW - Polyhedral analysis
KW - RAJA
KW - Symbolic execution
UR - http://www.scopus.com/inward/record.url?scp=85107509838&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107509838&partnerID=8YFLogxK
U2 - 10.1145/3447818.3461665
DO - 10.1145/3447818.3461665
M3 - Conference contribution
AN - SCOPUS:85107509838
T3 - Proceedings of the International Conference on Supercomputing
SP - 1
EP - 12
BT - ICS 2021 - Proceedings of the 2021 ACM International Conference on Supercomputing
PB - Association for Computing Machinery
T2 - 35th ACM International Conference on Supercomputing, ICS 2021
Y2 - 14 June 2021 through 17 June 2021
ER -