Bayesian modelling of high-throughput sequencing assays with malacoda

Andrew R Ghazi; Xianguo Kong; Ed S Chen; Leonard C Edelstein; Chad A Shaw

doi:10.1371/journal.pcbi.1007504

Bayesian modelling of high-throughput sequencing assays with malacoda

PLoS Comput Biol. 2020 Jul 21;16(7):e1007504. doi: 10.1371/journal.pcbi.1007504. eCollection 2020 Jul.

Authors

Andrew R Ghazi¹, Xianguo Kong², Ed S Chen³, Leonard C Edelstein², Chad A Shaw³

Affiliations

¹ Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, Texas, United States of America.
² Cardeza Foundation for Hematologic Research, Thomas Jefferson University, Philadelphia, Pennsylvania, United States of America.
³ Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America.

Abstract

NGS studies have uncovered an ever-growing catalog of human variation while leaving an enormous gap between observed variation and experimental characterization of variant function. High-throughput screens powered by NGS have greatly increased the rate of variant functionalization, but the development of comprehensive statistical methods to analyze screen data has lagged. In the massively parallel reporter assay (MPRA), short barcodes are counted by sequencing DNA libraries transfected into cells and the cell's output RNA in order to simultaneously measure the shifts in transcription induced by thousands of genetic variants. These counts present many statistical challenges, including overdispersion, depth dependence, and uncertain DNA concentrations. So far, the statistical methods used have been rudimentary, employing transformations on count level data and disregarding experimental and technical structure while failing to quantify uncertainty in the statistical model. We have developed an extensive framework for the analysis of NGS functionalization screens available as an R package called malacoda (available from github.com/andrewGhazi/malacoda). Our software implements a probabilistic, fully Bayesian model of screen data. The model uses the negative binomial distribution with gamma priors to model sequencing counts while accounting for effects from input library preparation and sequencing depth. The method leverages the high-throughput nature of the assay to estimate the priors empirically. External annotations such as ENCODE data or DeepSea predictions can also be incorporated to obtain more informative priors-a transformative capability for data integration. The package also includes quality control and utility functions, including automated barcode counting and visualization methods. To validate our method, we analyzed several datasets using malacoda and alternative MPRA analysis methods. These data include experiments from the literature, simulated assays, and primary MPRA data. We also used luciferase assays to experimentally validate several hits from our primary data, as well as variants for which the various methods disagree and variants detectable only with the aid of external annotations.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Bayes Theorem
Computational Biology / methods*
Genetic Variation / genetics
High-Throughput Nucleotide Sequencing / methods*
Humans
Models, Statistical*
Sequence Analysis, DNA / methods*
Software*

Grants and funding

R01 HL128234/HL/NHLBI NIH HHS/United States