TY - GEN
T1 - Runtime Composition of Iterations for Fusing Loop-Carried Sparse Dependence
AU - Cheshmi, Kazem
AU - Strout, Michelle Mills
AU - Dehnavi, Maryam Mehri
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023
Y1 - 2023
N2 - Dependence between iterations in sparse computations causes inefficient use of memory and computation resources. This paper proposes sparse fusion, a technique that generates efficient parallel code for the combination of two sparse matrix kernels, where at least one of the kernels has loop-carried dependencies. Existing implementations optimize individual sparse kernels separately. However, this approach leads to synchronization overheads and load imbalance due to the irregular dependence patterns of sparse kernels, as well as inefficient cache usage due to their irregular memory access patterns. Sparse fusion uses a novel inspection strategy and code transformation to generate parallel fused code optimized for data locality and load balance. Sparse fusion outperforms the best of unfused implementations using ParSy and MKL by an average of 4.2× and is faster than the best of fused implementations using existing scheduling algorithms, such as LBC, DAGP, and wavefront by an average of 4× for various kernel combinations.
AB - Dependence between iterations in sparse computations causes inefficient use of memory and computation resources. This paper proposes sparse fusion, a technique that generates efficient parallel code for the combination of two sparse matrix kernels, where at least one of the kernels has loop-carried dependencies. Existing implementations optimize individual sparse kernels separately. However, this approach leads to synchronization overheads and load imbalance due to the irregular dependence patterns of sparse kernels, as well as inefficient cache usage due to their irregular memory access patterns. Sparse fusion uses a novel inspection strategy and code transformation to generate parallel fused code optimized for data locality and load balance. Sparse fusion outperforms the best of unfused implementations using ParSy and MKL by an average of 4.2× and is faster than the best of fused implementations using existing scheduling algorithms, such as LBC, DAGP, and wavefront by an average of 4× for various kernel combinations.
UR - http://www.scopus.com/inward/record.url?scp=85190417501&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85190417501&partnerID=8YFLogxK
U2 - 10.1145/3581784.3607097
DO - 10.1145/3581784.3607097
M3 - Conference contribution
AN - SCOPUS:85179553254
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - SC 2023 - International Conference for High Performance Computing, Networking, Storage and Analysis
PB - IEEE Computer Society
T2 - 2023 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023
Y2 - 12 November 2023 through 17 November 2023
ER -