Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods

John C Mu; Pegah Tootoonchi Afshar; Marghoob Mohiyuddin; Xi Chen; Jian Li; Narges Bani Asadi; Mark B Gerstein; Wing H Wong; Hugo Y K Lam

doi:10.1038/srep14493

Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods

Sci Rep. 2015 Sep 28:5:14493. doi: 10.1038/srep14493.

Authors

John C Mu¹, Pegah Tootoonchi Afshar², Marghoob Mohiyuddin¹, Xi Chen³, Jian Li¹, Narges Bani Asadi¹, Mark B Gerstein⁴, Wing H Wong^{3

5}, Hugo Y K Lam¹

Affiliations

¹ Bina Technologies, Roche Sequencing, Redwood City, CA 94065, USA.
² Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA.
³ Department of Statistics, Stanford University, Stanford, CA 94305, USA.
⁴ Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA.
⁵ Department of Health Research and Policy, Stanford University, Stanford, CA 94305, USA.

Abstract

A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Benchmarking*
Genetic Variation
Genome, Human
Genomics / methods
High-Throughput Nucleotide Sequencing / methods*
High-Throughput Nucleotide Sequencing / standards
Humans

Abstract

Publication types

MeSH terms

Grants and funding