TY - JOUR
T1 - Interpreting Supervised Machine Learning Inferences in Population Genomics Using Haplotype Matrix Permutations
AU - Tran, Linh N.
AU - Castellano, David
AU - Gutenkunst, Ryan N.
N1 - Publisher Copyright:
© The Author(s) 2025. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.
PY - 2025/10/1
Y1 - 2025/10/1
N2 - Supervised machine learning methods, such as convolutional neural networks (CNNs), that use haplotype matrices as input data have become powerful tools for population genomics inference. However, these methods often lack interpretability, making it difficult to understand which population genetics features drive their predictions—a critical limitation for method development and biological interpretation. Here, we introduce a systematic permutation approach that progressively disrupts population genetics features within input test haplotype matrices, including linkage disequilibrium, haplotype structure, and allele frequencies. By measuring performance degradation after each permutation, the importance of each feature can be assessed. We applied our approach to three published CNNs for positive selection and demographic history inference. We found that the positive selection inference CNN ImaGene critically depends on haplotype structure and linkage disequilibrium patterns, while the demographic inference CNN relies primarily on allele frequency information. Surprisingly, another positive selection inference CNN, disc-pg-gan, achieved high accuracy using only simple allele count information, suggesting its training regime may not adequately challenge the model to learn complex population genetic signatures. Our approach provides a straightforward, model-agnostic, and biologically-motivated framework for interpreting any haplotype matrix-based method, offering insights that can guide both method development and application in population genomics.
AB - Supervised machine learning methods, such as convolutional neural networks (CNNs), that use haplotype matrices as input data have become powerful tools for population genomics inference. However, these methods often lack interpretability, making it difficult to understand which population genetics features drive their predictions—a critical limitation for method development and biological interpretation. Here, we introduce a systematic permutation approach that progressively disrupts population genetics features within input test haplotype matrices, including linkage disequilibrium, haplotype structure, and allele frequencies. By measuring performance degradation after each permutation, the importance of each feature can be assessed. We applied our approach to three published CNNs for positive selection and demographic history inference. We found that the positive selection inference CNN ImaGene critically depends on haplotype structure and linkage disequilibrium patterns, while the demographic inference CNN relies primarily on allele frequency information. Surprisingly, another positive selection inference CNN, disc-pg-gan, achieved high accuracy using only simple allele count information, suggesting its training regime may not adequately challenge the model to learn complex population genetic signatures. Our approach provides a straightforward, model-agnostic, and biologically-motivated framework for interpreting any haplotype matrix-based method, offering insights that can guide both method development and application in population genomics.
UR - https://www.scopus.com/pages/publications/105019989777
UR - https://www.scopus.com/pages/publications/105019989777#tab=citedBy
U2 - 10.1093/molbev/msaf250
DO - 10.1093/molbev/msaf250
M3 - Article
C2 - 41052877
AN - SCOPUS:105019989777
SN - 0737-4038
VL - 42
JO - Molecular biology and evolution
JF - Molecular biology and evolution
IS - 10
M1 - msaf250
ER -