TY - GEN
T1 - Quantitative trait locus analysis using a partitioned linear model on a GPU cluster
AU - Bailey, Peter E.
AU - Patki, Tapasya
AU - Striemer, Gregory M.
AU - Akoglu, Ali
AU - Lowenthal, David K.
AU - Bradbury, Peter
AU - Vaughn, Matt
AU - Wang, Liya
AU - Goff, Stephen A
PY - 2012
Y1 - 2012
N2 - Quantitative Trait Locus (QTL) analysis is a statistical technique that allows understanding of the relationship between plant genotypes and the resultant continuous phenotypes in non-constant environments. This requires generation and processing of large datasets, which makes analysis challenging and slow. One approach, which is the subject of this paper, is Partitioned Linear Modeling (PLM), lends itself well to parallelization, both by MPI between nodes and by GPU within nodes. Large input datasets make this parallelization on the GPU non-trivial. This paper compares several candidate integrated MPI/GPU parallel implementations of PLM on a cluster of GPUs for varied data sets. We compare them to a naive implementation and show that while that implementation is quite efficient on small data sets, when the data set is large, data-transfer overhead dominates an all-GPU implementation of PLM. We show that an MPI implementation that selectively uses the GPU for a relative minority of the code performs best and results in a 64 improvement over the MPI/CPU version. As a first implementation of PLM on GPUs, our work serves as a reminder that different GPU implementations are needed, depending on the size of the working set, and that data intensive applications are not necessarily trivially parallelizable with GPUs.
AB - Quantitative Trait Locus (QTL) analysis is a statistical technique that allows understanding of the relationship between plant genotypes and the resultant continuous phenotypes in non-constant environments. This requires generation and processing of large datasets, which makes analysis challenging and slow. One approach, which is the subject of this paper, is Partitioned Linear Modeling (PLM), lends itself well to parallelization, both by MPI between nodes and by GPU within nodes. Large input datasets make this parallelization on the GPU non-trivial. This paper compares several candidate integrated MPI/GPU parallel implementations of PLM on a cluster of GPUs for varied data sets. We compare them to a naive implementation and show that while that implementation is quite efficient on small data sets, when the data set is large, data-transfer overhead dominates an all-GPU implementation of PLM. We show that an MPI implementation that selectively uses the GPU for a relative minority of the code performs best and results in a 64 improvement over the MPI/CPU version. As a first implementation of PLM on GPUs, our work serves as a reminder that different GPU implementations are needed, depending on the size of the working set, and that data intensive applications are not necessarily trivially parallelizable with GPUs.
UR - http://www.scopus.com/inward/record.url?scp=84867411359&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84867411359&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2012.93
DO - 10.1109/IPDPSW.2012.93
M3 - Conference contribution
AN - SCOPUS:84867411359
SN - 9780769546766
T3 - Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012
SP - 752
EP - 760
BT - Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012
T2 - 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012
Y2 - 21 May 2012 through 25 May 2012
ER -