Variant discovery and individual genotypes for 1092 individuals from the 1000
Genomes Project Phase 1 data release. This accession is a snapshot, as of June 26, 2013, of
directory
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets
Individual level genotype information is contained in a separate .vcf file for each
chromosome; the variant site information (first eight columns of .vcf format) is also
collected in a single sites-only .vcf file. The .vcf version 4.1 file format is described
at
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
Variant discovery and genotyping are based on both low coverage whole genome sequencing and
deep coverage exon capture sequencing by the 1000 Genomes Project. The Phase 1 low coverage
sequencing was done between 2008 and 2010 using Illumina (86.4%), AB SOLiD (13%) and Roche
LS 454 (0.6%) sequencing technologies, with a raw sequencing data freeze date of Nov 23,
2010. Exon capture sequencing was done in 2010 and early 2011, using either the NimbleGen
SeqCap_EZ_Exome_v2 or Agilent SureSelect_All_Exon_V2 exome capture reagents, with a freeze
date of May 21, 2011. After filtering, variant site discovery yields a total of 38 M SNPs,
1.4 M short indels and 14 K large deletions (longer than 50 bp). The short indels and large
deletions in particular have been stringently filtered in order to achieve the project goal
of not more than 5% false discovery rate. All coordinates are relative to the GRCh37 human
genome reference sequence. Site discovery and genotyping for SNPs are based on both low
coverage and exome capture sequence data; for short indels, only low coverage data, and for
large deletions, only the 946 individuals with Illumina low coverage sequence data are
used. As a final step, statistical genotype imputation using the linkage disequilibrium
between nearby sites refines the individual genotypes and merges information from all three
variant types. Details of data generation and processing for this data set are given in
supplementary material to the 1000 Genomes Phase 1 paper: 1000 Genomes Project Consortium,
Goncalo R. Abecasis, Adam Auton, Lisa D. Brooks, Mark A. DePristo, Richard M. Durbin,
Robert E. Handsaker, Hyun M. Kang, Gabor T. Marth, Gil A. McVean (2012). An integrated map
of genetic variation for 1,092 human genomes. Nature, v.491, n.7422, pp.56-65, Nov 1, 2012,
PMID: 23128226, doi: 10.1038/nature11632. Further details and supporting data are in other
subdirectories under: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1
Less...