TY - JOUR
T1 - Iterative hard thresholding in genome-wide association studies
T2 - Generalized linear models, prior weights, and double sparsity
AU - Chu, Benjamin B.
AU - Keys, Kevin L.
AU - German, Christopher A.
AU - Zhou, Hua
AU - Zhou, Jin J.
AU - Sobel, Eric M.
AU - Sinsheimer, Janet S.
AU - Lange, Kenneth
N1 - Funding Information:
The NFBC1966 Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with the Broad Institute, UCLA, University of Oulu, and the National Institute for Health and Welfare in Finland. This manuscript was not prepared in collaboration with investigators of the NFBC1966 Study and does not necessarily reflect the opinions or views of the NFBC1966 Study Investigators, Broad Institute, UCLA, University of Oulu, National Institute for Health and Welfare in Finland, and the NHLBI.
Funding Information:
B.B.C. was supported by NIH T32-HG002536 training grant and the 2018 Google Summer of Code. K.L.K. was supported by a diversity supplement to NHLBI grant R01HL135156, the UCSF Bakar Computational Health Sciences Institute, the Gordon and Betty Moore Foundation grant GBMF3834, and the Alfred P. Sloan Foundation grant 2013-10-27 to UC Berkeley through the Moore-Sloan Data Sciences Environment initiative at the Berkeley Institute for Data Science (BIDS). E.M.S, K.L., and H.Z. were supported by grants from the National Human Genome Research Institute (HG006139) and the National Institute of General Medical Sciences (GM053275). J.S.S. was supported by grants from the National Institute of General Medical Sciences (GM053275), the National Human Genome Research Institute (HG009120), and the National Science Foundation (DMS-1264153). C.A.G. was supported by the Burroughs Wellcome Fund Inter-school Training Program in Chronic Diseases (BWF-CHIP).
Publisher Copyright:
© 2020 The Author(s) 2020. Published by Oxford University Press.
PY - 2020/6/10
Y1 - 2020/6/10
N2 - Background: Consecutive testing of single nucleotide polymorphisms (SNPs) is usually employed to identify genetic variants associated with complex traits. Ideally one should model all covariates in unison, but most existing analysis methods for genome-wide association studies (GWAS) perform only univariate regression. Results: We extend and efficiently implement iterative hard thresholding (IHT) for multiple regression, treating all SNPs simultaneously. Our extensions accommodate generalized linear models, prior information on genetic variants, and grouping of variants. In our simulations, IHT recovers up to 30% more true predictors than SNP-by-SNP association testing and exhibits a 2-3 orders of magnitude decrease in false-positive rates compared with lasso regression. We also test IHT on the UK Biobank hypertension phenotypes and the Northern Finland Birth Cohort of 1966 cardiovascular phenotypes. We find that IHT scales to the large datasets of contemporary human genetics and recovers the plausible genetic variants identified by previous studies. Conclusions: Our real data analysis and simulation studies suggest that IHT can (i) recover highly correlated predictors, (ii) avoid over-fitting, (iii) deliver better true-positive and false-positive rates than either marginal testing or lasso regression, (iv) recover unbiased regression coefficients, (v) exploit prior information and group-sparsity, and (vi) be used with biobank-sized datasets. Although these advances are studied for genome-wide association studies inference, our extensions are pertinent to other regression problems with large numbers of predictors.
AB - Background: Consecutive testing of single nucleotide polymorphisms (SNPs) is usually employed to identify genetic variants associated with complex traits. Ideally one should model all covariates in unison, but most existing analysis methods for genome-wide association studies (GWAS) perform only univariate regression. Results: We extend and efficiently implement iterative hard thresholding (IHT) for multiple regression, treating all SNPs simultaneously. Our extensions accommodate generalized linear models, prior information on genetic variants, and grouping of variants. In our simulations, IHT recovers up to 30% more true predictors than SNP-by-SNP association testing and exhibits a 2-3 orders of magnitude decrease in false-positive rates compared with lasso regression. We also test IHT on the UK Biobank hypertension phenotypes and the Northern Finland Birth Cohort of 1966 cardiovascular phenotypes. We find that IHT scales to the large datasets of contemporary human genetics and recovers the plausible genetic variants identified by previous studies. Conclusions: Our real data analysis and simulation studies suggest that IHT can (i) recover highly correlated predictors, (ii) avoid over-fitting, (iii) deliver better true-positive and false-positive rates than either marginal testing or lasso regression, (iv) recover unbiased regression coefficients, (v) exploit prior information and group-sparsity, and (vi) be used with biobank-sized datasets. Although these advances are studied for genome-wide association studies inference, our extensions are pertinent to other regression problems with large numbers of predictors.
KW - GWAS
KW - biobank
KW - high dimensional inference
KW - iterative hard thresholding
KW - multiple regression
UR - http://www.scopus.com/inward/record.url?scp=85085961831&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85085961831&partnerID=8YFLogxK
U2 - 10.1093/gigascience/giaa044
DO - 10.1093/gigascience/giaa044
M3 - Article
C2 - 32491161
AN - SCOPUS:85085961831
SN - 2047-217X
VL - 9
JO - GigaScience
JF - GigaScience
IS - 6
M1 - giaa044
ER -