|
|
| README for 1000 Genomes Pilot 1 SNP calls: |
| Lymphoblastoid cell line DNA for each individual is sequenced on |
| Illumina, Roche 454 or AB SOLiD platforms and mapped to the NCBI |
| version 36 human genome reference sequence using MAQ for Illumina |
| data, ssaha2 for Roche 454 and Corona_lite 4.01 with AB SOLiD data. |
| The average depth of sequence coverage at HapMap 3 sites is 5.1 x |
| per individual for CEU, 2.8 x for CHB+JPT, and 3.7 x for YRI. |
| Illumina and Roche 454 base call quality scores are recalibrated |
| using the GATK software and PCR duplicate reads are removed with |
| Picard MarkDuplicates. Neither base call quality recalibration |
| nor duplicate removal were done for AB SOLiD data. Complete BAM |
| files containing all mapped sequence reads are available from |
| ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/pilot_data/data/ |
| NA*/alignment/ in files: |
| NAxxxxx.SLX.maq.SRP000031.2009_08.bam - for Illumina Solexa data |
| NAxxxxx.454.ssaha.SRP000031.2009_10.bam - for Roche 454 data |
| NAxxxxx.SOLID.corona.SRP000031.2009_08.bam - for AB SOLiD data |
| For each population, three primary call sets were made, by Jared |
| Maguire and colleagues at the Broad Institute using GATK (BI), by |
| Yun Li and Goncalo Abecasis from the University of Michigan using |
| MACH (UMich), and by Quang Le and Richard Durbin at the Sanger |
| Institute using QCALL (SI). The final release SNP loci consist of |
| all sites called in at least two of the three primary sets that also |
| pass the following filters: |
| (1) the call is not removed by local realignment of reads around the |
| call using the GATK local realignment tool. This filter removes a |
| relatively small number of spurious calls caused by alignment |
| problems around indels. |
| (2) the summed depth of all reads covering the call is less than twice |
| the average summed depth at HapMap3 sites -- specifically, less |
| than or equal to 625 for CEU, 445 for YRI and 330 for CHB+JPT. |
| This eliminates a small number of calls at high copy number sites. |
| (3) less than or equal to 20% of all Illumina calls covering this site |
| have MAQ mapping quality score 0. This filter removes calls at sites |
| in more repetitive sequence regions. The use of a threshold based on |
| Illumina read mapping quality scores does not indicate that Illumina |
| reads are the primary source of errors -- this is simply chosen as a |
| proxy for the ability to accurately map reads from any technology to |
| the region. (The other short read aligners used, Corona_lite and |
| ssaha2, do not provide mapping quality scores.) |
| Mask files for each population indicate for each base in the genome |
| whether it passes filters 2 and 3. These are available by anonymous |
| ftp at ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/pilot_data/ |
| release/2010_03/pilot1/supporting/*.mask.fa.gz. Larger files giving |
| the read depth at each position are in subdirectory /supporting/depths/. |
| Genotype calls for the sequenced individuals are also the consensus of |
| those from the three methods, i.e. if all three of the methods or two |
| of them agree on a genotype, that one is called. Calls for CEU and YRI |
| are phased based on the SI calls, which were made on the trio-phased |
| HapMap3 haplotypes, except for individuals NA10851, NA12717, NA12004, |
| NA10847, NA12414 for CEU, and individuals NA18523, NA18522, NA19129, |
| NA18502, NA18907, NA18856 for YRI. There are no trio-phased HapMap 3 |
| haplotypes for these individuals. These individuals, and all of the |
| CHB+JPT genotype calls, are phased according to the UMich primary call |
| set. In the rare cases where there is no consensus genotype call, |
| the SI call is used for CEU and YRI and the UMich call for CHB+JPT. |
| This 2/3 consensus genotype strategy produces a few thousand loci where |
| a SNP is called by either two or three methods, but different individuals |
| show the minor allele in the genotype calls from each method. In this |
| case, the 2/3 consensus rule legitimately says that every individual is |
| homozygous reference, and the alternate allele is shown as "." with |
| allele count zero. These sites are included in the .vcf files but are |
| excluded from the Pilot 1 submission to dbSNP. For similar reasons, the |
| third allele at some tri-allelic loci will show an allele count of zero. |
| (28/3/10 -- Richard Durbin on behalf of the 1000 Genomes Project |
| Analysis group, adapted by Tom Blackwell for dbSNP submission.) |