TY - GEN
T1 - Runtime Composition of Iterations for Fusing Loop-carried Sparse Dependence
AU - Cheshmi, Kazem
AU - Strout, Michelle
AU - Mehri Dehnavi, Maryam
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/11/12
Y1 - 2023/11/12
N2 - Dependence between iterations in sparse computations causes inefficient use of memory and computation resources. This paper proposes sparse fusion, a technique that generates efficient parallel code for the combination of two sparse matrix kernels, where at least one of the kernels has loop-carried dependencies. Existing implementations optimize individual sparse kernels separately. However, this approach leads to synchronization overheads and load imbalance due to the irregular dependence patterns of sparse kernels, as well as inefficient cache usage due to their irregular memory access patterns. Sparse fusion uses a novel inspection strategy and code transformation to generate parallel fused code optimized for data locality and load balance. Sparse fusion outperforms the best of unfused implementations using ParSy and MKL by an average of 4.2× and is faster than the best of fused implementations using existing scheduling algorithms, such as LBC, DAGP, and wavefront by an average of 4× for various kernel combinations.
AB - Dependence between iterations in sparse computations causes inefficient use of memory and computation resources. This paper proposes sparse fusion, a technique that generates efficient parallel code for the combination of two sparse matrix kernels, where at least one of the kernels has loop-carried dependencies. Existing implementations optimize individual sparse kernels separately. However, this approach leads to synchronization overheads and load imbalance due to the irregular dependence patterns of sparse kernels, as well as inefficient cache usage due to their irregular memory access patterns. Sparse fusion uses a novel inspection strategy and code transformation to generate parallel fused code optimized for data locality and load balance. Sparse fusion outperforms the best of unfused implementations using ParSy and MKL by an average of 4.2× and is faster than the best of fused implementations using existing scheduling algorithms, such as LBC, DAGP, and wavefront by an average of 4× for various kernel combinations.
UR - https://www.scopus.com/pages/publications/85179553254
UR - https://www.scopus.com/pages/publications/85179553254#tab=citedBy
U2 - 10.1145/3581784.3607097
DO - 10.1145/3581784.3607097
M3 - Conference contribution
AN - SCOPUS:85179553254
T3 - Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023
BT - Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023
PB - Association for Computing Machinery, Inc
T2 - 2023 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023
Y2 - 12 November 2023 through 17 November 2023
ER -