TY - GEN
T1 - Transforming loop chains via macro dataflow graphs
AU - Davis, Eddie C.
AU - Strout, Michelle Mills
AU - Olschanowsky, Catherine
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/2/24
Y1 - 2018/2/24
N2 - This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized. We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.
AB - This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized. We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.
KW - Dataflow
KW - Loop chain
KW - Stencil
KW - Storage optimizations
UR - https://www.scopus.com/pages/publications/85052761903
UR - https://www.scopus.com/pages/publications/85052761903#tab=citedBy
U2 - 10.1145/3168832
DO - 10.1145/3168832
M3 - Conference contribution
AN - SCOPUS:85052761903
T3 - CGO 2018 - Proceedings of the 2018 International Symposium on Code Generation and Optimization
SP - 265
EP - 277
BT - CGO 2018 - Proceedings of the 2018 International Symposium on Code Generation and Optimization
PB - Association for Computing Machinery, Inc
T2 - 16th International Symposium on Code Generation and Optimization, CGO 2018
Y2 - 24 February 2018 through 28 February 2018
ER -