Integrative sparse principal component analysis of gene expression data

Genet Epidemiol. 2017 Dec;41(8):844-865. doi: 10.1002/gepi.22089. Epub 2017 Nov 8.

Abstract

In the analysis of gene expression data, dimension reduction techniques have been extensively adopted. The most popular one is perhaps the PCA (principal component analysis). To generate more reliable and more interpretable results, the SPCA (sparse PCA) technique has been developed. With the "small sample size, high dimensionality" characteristic of gene expression data, the analysis results generated from a single dataset are often unsatisfactory. Under contexts other than dimension reduction, integrative analysis techniques, which jointly analyze the raw data of multiple independent datasets, have been developed and shown to outperform "classic" meta-analysis and other multidatasets techniques and single-dataset analysis. In this study, we conduct integrative analysis by developing the iSPCA (integrative SPCA) method. iSPCA achieves the selection and estimation of sparse loadings using a group penalty. To take advantage of the similarity across datasets and generate more accurate results, we further impose contrasted penalties. Different penalties are proposed to accommodate different data conditions. Extensive simulations show that iSPCA outperforms the alternatives under a wide spectrum of settings. The analysis of breast cancer and pancreatic cancer data further shows iSPCA's satisfactory performance.

Keywords: contrasted penalization; gene expression data; integrative analysis; sparse PCA.

MeSH terms

  • Algorithms
  • Breast Neoplasms / genetics
  • Breast Neoplasms / pathology
  • Female
  • Gene Expression Regulation, Neoplastic
  • Humans
  • Models, Genetic*
  • Pancreatic Neoplasms / genetics
  • Pancreatic Neoplasms / pathology
  • Principal Component Analysis