TY - GEN
T1 - Transforming loop chains via macro dataflow graphs
AU - Davis, Eddie C.
AU - Strout, Michelle Mills
AU - Olschanowsky, Catherine
N1 - Funding Information:
The authors would like to thank Dr. Keshav Pingali and anonymous reviewers for their invaluable feedback. We would like to thank Dr. Uday R. Bondhugula for his technical assistance with Halide and PolyMage. We would like to acknowledge high-performance computing support of the R2 compute cluster, DOI: 10.18122/B2S41H, provided by Boise State University’s Research Computing Department. This material is based upon work supported by the National Science Foundation under Grant No. 1422725 and by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research Program under Award Number DE-SC-04030.
Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/2/24
Y1 - 2018/2/24
N2 - This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized. We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.
AB - This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized. We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.
KW - Dataflow
KW - Loop chain
KW - Stencil
KW - Storage optimizations
UR - http://www.scopus.com/inward/record.url?scp=85052761903&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85052761903&partnerID=8YFLogxK
U2 - 10.1145/3168832
DO - 10.1145/3168832
M3 - Conference contribution
AN - SCOPUS:85052761903
T3 - CGO 2018 - Proceedings of the 2018 International Symposium on Code Generation and Optimization
SP - 265
EP - 277
BT - CGO 2018 - Proceedings of the 2018 International Symposium on Code Generation and Optimization
PB - Association for Computing Machinery, Inc
T2 - 16th International Symposium on Code Generation and Optimization, CGO 2018
Y2 - 24 February 2018 through 28 February 2018
ER -