Incorporating group correlations in genome-wide association studies using smoothed group Lasso

Biostatistics. 2013 Apr;14(2):205-19. doi: 10.1093/biostatistics/kxs034. Epub 2012 Sep 17.

Abstract

In genome-wide association studies, penalization is an important approach for identifying genetic markers associated with disease. Motivated by the fact that there exists natural grouping structure in single nucleotide polymorphisms and, more importantly, such groups are correlated, we propose a new penalization method for group variable selection which can properly accommodate the correlation between adjacent groups. This method is based on a combination of the group Lasso penalty and a quadratic penalty on the difference of regression coefficients of adjacent groups. The new method is referred to as smoothed group Lasso (SGL). It encourages group sparsity and smoothes regression coefficients for adjacent groups. Canonical correlations are applied to the weights between groups in the quadratic difference penalty. We first derive a GCD algorithm for computing the solution path with linear regression model. The SGL method is further extended to logistic regression for binary response. With the assistance of the majorize-minimization algorithm, the SGL penalized logistic regression turns out to be an iteratively penalized least-square problem. We also suggest conducting principal component analysis to reduce the dimensionality within groups. Simulation studies are used to evaluate the finite sample performance. Comparison with group Lasso shows that SGL is more effective in selecting true positives. Two datasets are analyzed using the SGL method.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms
  • Arthritis, Rheumatoid / genetics
  • Arthritis, Rheumatoid / immunology
  • Biostatistics
  • Data Interpretation, Statistical
  • Databases, Genetic / statistics & numerical data
  • Genome-Wide Association Study / statistics & numerical data*
  • HLA Antigens / genetics
  • Humans
  • Linear Models
  • Polymorphism, Single Nucleotide
  • Principal Component Analysis

Substances

  • HLA Antigens