Prediction analysis for microbiome sequencing data

Tao Wang; Can Yang; Hongyu Zhao

doi:10.1111/biom.13061

Prediction analysis for microbiome sequencing data

Biometrics. 2019 Sep;75(3):875-884. doi: 10.1111/biom.13061. Epub 2019 Apr 17.

Authors

Tao Wang^{1

2

3}, Can Yang⁴, Hongyu Zhao^{3

5}

Affiliations

¹ Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai, China.
² MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China.
³ SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China.
⁴ Department of Mathematics, Hong Kong University of Science and Technology, Kowloon, Hong Kong.
⁵ Department of Biostatistics, Yale University, New Haven, Connecticut.

PMID: 30994187
DOI: 10.1111/biom.13061

Abstract

One goal of human microbiome studies is to relate host traits with human microbiome compositions. The analysis of microbial community sequencing data presents great statistical challenges, especially when the samples have different library sizes and the data are overdispersed with many zeros. To address these challenges, we introduce a new statistical framework, called predictive analysis in metagenomics via inverse regression (PAMIR), to analyze microbiome sequencing data. Within this framework, an inverse regression model is developed for overdispersed microbiota counts given the trait, and then a prediction rule is constructed by taking advantage of the dimension-reduction structure in the model. An efficient Monte Carlo expectation-maximization algorithm is proposed for maximum likelihood estimation. The method is further generalized to accommodate other types of covariates. We demonstrate the advantages of PAMIR through simulations and two real data examples.

Keywords: expectation-maximization algorithm; log ratios; metagenomic data; model-based dimension reduction; multinomial-logit regression.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Bacteria / genetics
Humans
Likelihood Functions
Microbiota / genetics*
Monte Carlo Method
Regression Analysis
Sequence Analysis*