GENEVA alcohol-dependence project: Quality control report
15 September 2008Table of Contents
- Project overview
- Quality control process and participants
- Sample number and composition
- Subject characteristics
- Gender and chromosomal anomalies
- Relatedness
- Population structure
- Missing call rates
- Batch effects
- Case status and missing call rate
- Mendelian error detection
- Duplicate error detection
- Sample exclusion and filtering
- SNP filters
- Hardy-Weinberg equilibrium
- Preliminary association tests
- URLs
- Literature Cited
- Appendix A Project participants
- Appendix B: Sample filter vectors and sample-by-chromosome filter matrices
- Tables and Figures
Project overview
The alcohol dependence project combines samples from three different studies: (1) the Collaborative Study on the Genetics of Alcoholism (COGA), (2) the Collaborative Genetic Study of Nicotine Dependence (COGEND) and (3) the Family Study of Cocaine Dependence (FSCD). For the GENEVA GWAS project, cases were defined as individuals with DSM-IV alcohol dependence and potentially other illicit drug dependence; controls were defined as individuals who had been exposed to alcohol (and possibly to other drugs), but had never become addicted to alcohol or other illicit substances (lifetime diagnoses); and a third group of subjects were defined as “other” if they were not alcohol dependent, but met lifetime diagnosis for marijuana or another illicit drug. Samples from all three studies, as well as HapMap controls, were genotyped on the Illumina Human 1M beadchip at the Center for Inherited Disease Research (CIDR) at Johns Hopkins University.
Quality control process and participants
Genotypic data for a total of 4324 DNA samples passed initial quality control at CIDR and were released to the GENEVA Coordinating Center (CC) at the University of Washington, the NCBI dbGaP team and the alcohol-dependence project team at Washington University. These data were further analyzed by all four groups and the results were discussed in weekly conference calls, which also included NHGRI personnel. Key participants in this process and their institutional affiliations are given in Appendix A. This document summarizes the findings of all participants.
Analysis tools varied by group and included primarily Illumina BeadStudio and SAS at CIDR, the R statistical package at the University of Washington, and PLINK at NCBI and Washington University. If not otherwise noted, analyses described below were done using R and the “ncdf” library to access data stored in netCDF files (see URL below).
Sample number and composition
The total sample set of 4324 contains 135 HapMap genotyping controls and 4189 project samples.
The 135 HapMap samples derive from 48 different subjects belonging to 16 trios (10 CEU and 6 YOR). Among the 48 HapMap subjects, 44 were genotyped from 2 to 5 times.
Among the 4189 project samples, 8 (representing 7 subjects) failed quality control and were removed from the data set. The 4181 passed samples derive from 4121 subjects, of which 60 were genotyped twice. Most of the subjects are unrelated, but 214 belong to 104 families (98 of size 2 and 6 of size 3).
Subject characteristics
Among the 4121 subjects, 1425 are from COGA, 1420 from COGEND and 1276 from FSCD. The distribution of case status across studies is given in Table 1.
Overall, the composition in terms of self-identified ethnicity is 2772 white (67%), 1340 black, and 9 in other categories. Subjects also self-identified Hispanic (141) or non-Hispanic (3980) ancestry. The gender composition is 2239 (54%) female and 1882 male. The ethnicity and gender composition varies by case-control status, as shown in Table 2.
The study subjects were recruited from 8 different study sites, including 7 states and the District of Columbia. The majority of subjects (62%) were recruited in Missouri.
Gender and chromosomal anomalies
One approach to gender identity checking is to look at the distribution of the intensities of SNP probes on the X and Y chromosomes, as shown in Figure 1. All samples annotated as male have a Y chromosome intensity greater than all samples annotated as female, suggesting that the gender annotations are correct. However, several unusual samples are delineated by the dashed lines in Figure 1. These samples are quite distinct from the majority of males and females, since the standard errors of the mean intensity values are very small due to the large samples sizes (40,097 X and 2,283 Y probes; see Figure 1 caption). Most of the samples that deviate from the main clusters of males or females are derived from lymphoblastoid cell lines, which may be prone to chromosome aneuploidy. However, some blood samples also show evidence of aneuploidy. For example, two samples annotated as male have a Y chromosome intensity typical of a male and an X chromosome intensity typical of a female, suggesting that they have an XXY karyotype. One blood-derived sample annotated as male has an unusually high Y chromosome intensity suggesting an XYY karyotype. Several cell line samples annotated as female have low X chromosome intensities typical of males and many others have intermediate X intensities, suggesting XX/XO mosaicism.
A plot of X chromosome heterozygosity versus mean X chromosome intensity for each sample is shown in Figure 2. Among the two presumptive XXY males, one has a relatively high X chromosome heterozygosity (suggesting perhaps an origin involving nondisjunction at the first meiotic division) and the other has a very low X heterozygosity (suggesting second division nondisjunction). All of the samples with very low X chromosome intensity have low heterozygosity typical of males, but those with intermediate intensity are quite variable in terms of heterozygosity.
Another approach to identifying chromosomal anomalies (such as aneuploidy and mosaic cell populations) is to examine the polar coordinate angle of heterozygotes. Figure 3 shows that trisomic cells (or a mixture of disomic and monosomic cells) can result in two different positions for heterozygotes at different loci. Illumina BeadStudio software provides a normalized version of the polar coordinate angle (θ), which they call “BAlleleFreq”. Figure 4a shows a normal chromosomal scan of BAlleleFreq for chromosome 1 in a sample from a cell line. The scan has bands at approximately 0, 0.5 and 1. The bands at 0 and 1 represent homozygotes and the band at 0.5 represents heterozygotes. Figure 4b shows a scan of the X chromosome for the same sample. In this case, there are two intermediate bands indicating that heterozygotes are either of the AAB or BBA type, as expected for a trisomic or mosaic sample. This sample is annotated as female and it has a mean X chromosome intensity of 0.93, which is at the lower limits of the distribution for most female samples, suggesting that the cell line is an XX/XO mosaic.
To identify aneuploid or mosaic samples systematically, we calculated, for each sample, the variance of BAlleleFreq values for SNPs that were not called as homozygous (i.e. either heterozygous or missing). We then examined chromosome scans (of the type shown in Figure 4) for the highest n samples of each sex in terms of the BAlleleFreq variance and the m highest and m lowest samples in terms of intensity. For the X chromosome, n=100 and m=50; for the autosomes, n=m=10.
Table 3 summarizes the number of chromosome anomalies detected as split intermediate bands in the BAlleleFreq chromosome scan and/or as outliers for X and/or Y chromosome intensity. More than 90% of the anomalies were detected in cell line samples, whereas only 34% of all samples are from cell lines. The X chromosome is more likely to have an anomaly than any of the autosomes. A total of 75 subjects each have one or two anomalies and 16 different chromosomes are involved.
Two examples of autosomal anomalies are shown in Figure 5. One involves all of chromosome 9 in a cell line sample and the other involves about one third of chromosome 11 in a blood sample.
Chromosomal anomalies are summarized in the file “chrom_anomalies.csv”. This table includes samples with a split intermediate BAlleleFreq band, outliers for X and/or Y chromosome intensity, and male samples with a Y chromosome heterozygosity greater than 5 standard deviations above the mean for all males. (The latter category is excluded from the summary in Table 3.)
Relatedness
The relatedness between each pair of subjects was evaluated by estimation of three coefficients corresponding to the probability that two (Z2), one (Z1) or zero (Z0) pairs of alleles are identical-by-descent (IBD). The kinship coefficient for a pair of subjects is Z2/2 + Z1/4. The estimation was done using a method of moments approach (using PLINK software) (Purcell et al. 2007), which is a computationally efficient approximation, and also using a maximum likelihood procedure, which is more computationally demanding (Weir et al. 2006). Table 4 shows the expected coefficients for some common relationships.
Figure 6 is a plot of estimates of Z1 versus Z0 for all pairs of subjects with a kinship coefficient estimate > 0.05. These estimates were obtained by analyzing all samples (including HapMap controls) together using the method of moments and PLINK software with 80k SNPs. The results are very similar when analyzing each ethnic group separately (data not shown) and when using a maximum likelihood method with 10k SNPs (Figure 7). In Figure 6, the subject pairs are annotated with their expected relationship, based on pedigree records. We inferred genetic relatedness from the IBD coefficient estimates according to the vertical dashed lines in Figure 6. Pairs in the left-most panel with Z1>0.8 are considered parent-offspring, while those with Z1<0.1 are duplicate samples. Pairs in the second panel are considered full siblings and those in the third panel are considered to have a half-sib-like relationship (i.e. half-siblings, avuncular or grandparent-grandchild). In some cases, the inferred relatedness is not consistent with the expected relationship. The records of these subjects were examined and, in all cases, the inferred relationship was considered plausible. Therefore, the pedigree records for these pairs were changed to be consistent with their genetic relatedness.
Population structure
Population structure was evaluated using principal components (with all autosomal SNPs having a call rate >95%), essentially as described by Patterson et al (2006). To supplement the 48 HapMap controls that were genotyped together with the study subjects, we included additional HapMap individuals that were genotyped on the Illumina 1M chip separately (see URL below). The total numbers of HapMap individuals included in this analysis are 57 CEU, 55 YOR, 13 CHB and 17 JPT.
Figure 8 shows a plot of the first two principal components (PCs) calculated with all unduplicated samples. PC1 separates the self-identified black and white subjects very well, while PC2 separates the Asian HapMap samples and the self-identified Hispanic subjects from the others. One Asian study subject falls in with the Asian HapMap samples and another sample falls close to the Asian group (PC2<-0.1). These two subjects were designated as population group outliers. Figure 9 shows pairwise plots of the first 4 principal components. PC3 and PC4 do not improve separation among the major ethnic groups and account for very small fractions of the total variance (<0.1%).
The data set used to calculate the PCs in Figure 8 includes some related subjects (total sample size of 4263). To evaluate the effect of relatives, one representative from each family was chosen based on phenotypic informativeness (reducing sample size to 4093) and the PCs were re-calculated. The first five PCs are highly correlated (r>=0.98), as shown in Figure 10 for PCs 1 and 2, whereas PCs 5-20 are not highly correlated (not shown). We are providing the first 20 PCs calculated from all unduplicated samples so that they can be used as covariates in association analyses that include and account for relatedness. These data are in the file “Principal_components.csv”.
We also calculated the PCs for each self-identified ethnic group separately. One group consists of 1254 unrelated study subjects (no HapMap controls) who are self-identified as black and non-Hispanic. A plot of PC1 vs. PC2 for the black subpopulation is shown in Figure 11, with color-coding by the study site in which the subjects were recruited. Although differentiation by study site is not apparent in the figure, analysis of variance shows that study site has a significant effect on PC1 (p=8e-06), but not on PC2. We also analyzed 2611 unrelated, self-identified white and non-Hispanic study subjects. Although the PC plot is dominated by few outliers (not shown), study site has a significant effect on both PC1 and PC2. Study site also has a significant effect on each of the first four PCs in the analysis described above (Figures 8 and 9), which includes all unduplicated samples. To ensure the privacy of subjects, study site data will not be provided on dbGaP.
Missing call rates
Perhaps the most important measure of genotypic data quality is the missing call rate. Two rates were calculated for each sample and for each SNP in the following way (and provided in files “SNP_analysis_results.csv” and “Sample_analysis_results.csv”). (1) “missing.n1” is the missing call rate per SNP over all samples. (2) “missing.e1” is the missing call rate per sample for all SNPs with missing.n1<100%. (3) “missing.n2” is the missing call rate per SNP over all samples with “missing.e1”<5%. In this case, all samples have “missing.e1”<5%, so “missing.n1”=”missing.n2”. (4) “missing.e2” is the missing call rate per sample over all SNPs with “missing.n2”<5%.
Figure 12 shows a histogram of the missing call rate per sample (“missing.e2”) for all 4324 of the samples originally released. 95% of all samples have a missing call rate <1% and all samples have a rate <3%. (The same is true for “missing.e1”, which is only slightly larger than “missing.e2”.)
The Illumina 1M chip has a total of 1,072,820 probes of which 23,812 are “intensity-only”, leaving 1,049,008 probes as SNP assays. Among the 1,049,008 SNP assays, 95% have a missing call rate <1.4% and the median is 0.05%.
Batch effects
To look for genotyping batch effects on the missing call rate per sample, we analyzed “missing.e2” for autosomal SNPs. Each genotyping batch is generally a group of samples that occupy 3 columns on a 96-well plate, which are processed together through the genotyping chemistry. The mean batch size of released samples is 20. We analyzed log10(missing.e2) because it is much closer to being normally distributed than the untransformed rate (although still significantly non-normal). Figure 13 shows the distribution of the mean of log10(missing.e2) for each batch (n=213 batches). The batch effect is very highly significant in analysis of variance (p=2e-195). Two batches have somewhat higher missing rates than the others. After removing these two batches, the batch effect is still very highly significant (p=1e-167). Because of the non-normal distribution of log10(missing.e2), these p-values are only approximations, but clearly show the importance of genotyping batch effects. The batch with highest mean missing rate consists of only 3 released samples, because all of the other samples in that batch failed quality control at CIDR. Because of this association and the high mean missing rate, the three samples in that batch were subsequently also labeled as failures (and not included in the final dataset posted on dbGaP).
Another way to detect genotyping batch effects is to assess the difference in allelic frequency between each batch and a pool of the other batches. We calculated a 1 d.f. chi-squared test statistic for each SNP and each batch and then summed these statistics over SNPs and normalized. The normalized score is a measure of how different each batch is from the other batches. It can be affected not only by laboratory processing, but also by the biological characteristics of the samples within the batch. The characteristic most likely to affect the distinctiveness of a batch is its ethnic composition relative to the mean composition across batches. Figure 14 shows a plot of genotyping batch ethnic composition versus the normalized chi-squared score. As expected, this plot shows that the difference is strongly dependent on ethnic composition. There are no outliers that would indicate a strong batch effect once ethnicity is taken into account.
Case status and missing call rate
A missing call rate difference between cases and controls can lead to spurious results in association studies because missingness may be nonrandom. Figure 15a shows the distribution of missing call rates for samples belonging to the three types of status in this project – case, control and “other” (dependent on drug other than alcohol). The effect of status type on log10 of missing rate is significant (p=0.01), which is due to the “other” type having a higher rate than either cases or controls. When “other” is removed from the analysis, the difference between cases and controls is not significant (p=0.25). The higher missing rate of “other” may be due to confounding with the type of tissue from which the DNA was extracted, as shown in Table 5. Figure 15b shows that cell line samples have a significantly higher missing call rate than whole blood samples (p=9e-14). The effect of tissue accounts for the difference in missing rate between “other” and the remaining two types, since the “other” samples are predominantly from cell lines.
Mendelian error detection
Genotyping errors can be detected in parent-offspring pairs and trios as genotype combinations that are inconsistent with Mendelian inheritance. This project includes 16 trios among the HapMap controls and 6 parent-offspring pairs among the study subjects. Many of the HapMap individuals were genotyped two or more times. To maximize the ability to detect errors, we examined all possible parent-offspring trios of replicate samples within a family.
For each SNP, the Mendelian error rate was calculated as the number of errors detected divided by the number of families in which the offspring and at least one parent are non-missing (and recorded as “mend.err” in the “SNP_analysis_results.csv” file). Among the 1,040,106 SNPs with a possibility for error detection, 99.1% had no errors and the mean is 0.04%. We also counted the number of families with at least one error. The distribution is shown in Figure 16. There is a large difference in the number of SNPs with one versus two families having at least one error (7674 and 900) and we recommend filtering out SNPs that have two or more families with at least one error (which is a total of 1251 SNPs).
The Mendelian error rate also can be expressed on a per family basis as the number of errors detected across all SNPs divided by the number of possibilities for error detection. These rates are summarized in Figure 17, which shows that the rates for trios appear to be larger than those for pairs (as expected because having both parents allows more errors to be detected), and the rates for self-identified black subjects may be greater than for whites (perhaps due to the greater SNP diversity in the black subpopulation), although the number of families is small.
Duplicate error detection
In this project, 60 study subjects were genotyped in duplicate and 44 HapMap subjects were genotyped from 2 to 5 times each. For each subject, all pairs of samples were examined for genotype discordance.
For each SNP, the discordance rate was calculated as the number of discordant genotype calls divided by the number of opportunities to detect discordance (recorded as “discord.dup.rate” in the file “SNP_analysis_results.csv”). Among the 1,040,106 SNPs with opportunity for detection, 99.7% had no discordance and the mean rate is 0.02%. We also counted the number of subjects (out of 104) with at least one discordance detected. The distribution is shown in Figure 18. There is a large difference in the number of SNPs with one versus two subjects having at least one discordance (20,182 and 2380) and we recommend filtering out SNPs that have two or more subjects with at least one discordance (which is a total of 3397 SNPs).
The discordance rate can also be expressed on a per subject basis as the number of discordant calls detected across all SNPs divided by the number of possibilities for error detection (i.e. both calls non-missing). The mean discordance rate is 0.02% and the distribution by self-identified ethnicity is shown in Figure 19. In an apparent reversal of the Mendelian error rate, the discordance rate for self-identified blacks is slightly lower than for self-identified whites.
Sample exclusion and filtering
Through the quality control process, 8 of 4324 samples with genotypic data were classified as failures and will be excluded from the dataset posted on dbGaP. These samples include 4 with poor genotypic data quality and 4 with a questionable connection between genotype and phenotype (which includes consent status). This leaves a total of 4316 samples.
Among the 4316 remaining samples, we suggest application of the following filters prior to most types of data analysis: (A) Filter out SNPs for each sample-by-chromosome combination with a chromosomal anomaly (as described above) and/or with a missing call rate per chromosome >= 5%. (B) Filter out SNPs on all chromosomes for each sample with an overall missing call rate >=2% and each sample considered an outlier to all major population groups. For specific analyses, such as Hardy-Weinberg testing, additional filters are suggested, such as filtering out all samples except one sample per subject, one subject per family, one ethnic group and one type of case status (usually controls). To facilitate the application of these filter suggestions, we are providing a set of sample-by-chromosome matrices and sample filter vectors. Most of the matrices are logical, with values of TRUE/FALSE to indicate retaining/eliminating each sample-by-chromosome combination. See Appendix B for details. We also are providing PLINK code for applying these filters to produce analysis-ready data sets.
Table 6 summarizes the suggested whole-sample quality filters (B above), which remove 29 out of 4316 samples (0.7%). Table 7 summarizes the chromosome-specific sample filters (A above), showing that the percent of potential genotype calls lost is 0.04% over all autosomes and about 1% for the sex chromosomes.
SNP filters
Table 8 summarizes a sequence of SNP filters suggested for removing assays of low quality or informativeness prior to viewing Hardy-Weinberg and association test results. The filters above the HWE row were used for the HWE QQ plots discussed below. A TRUE/FALSE vector for selecting these SNPs is provided in the HWE results file (“Hardy_Weinberg_tests.csv”). For the association test results discussed below, all of the filters in Table 8 were applied, although the minor allele frequency (MAF) filter suggested varies with the sample size. For all cases and controls (1899 and 1946, respectively) we suggest MAF=0.005; for the white subpopulation (1165 cases, 1379 controls) we suggest MAF=0.01 and for the black subpopulation (628 cases, 426 controls) we suggest MAF=0.02. A TRUE/FALSE vector for selecting SNPs is provided in each of the association test files (“Logistic_reg_all.csv”, “Logistic_reg_black_subpop.csv” and “Logistic_reg_white_subpop.csv”). Setting the filter thresholds is an attempt to balance the benefit of removing low quality SNPs against the risk of removing some high quality SNPs of interest and many are somewhat arbitrary. It is highly recommended to visually review cluster plots for all SNPs of interest.
Hardy-Weinberg equilibrium
We tested Hardy-Weinberg equilibrium (HWE) separately in 476 African-American and 1379 European control samples that are unduplicated, unrelated and passing quality filters (i.e. “scmat.sd2.afr.cntl.02.csv” and “scmat.sd2.eur.cntl.02.csv” described in Appendix B). In this case, the ethnic groups are defined by principal components, but the results are very similar when the groups are defined by self-identified ethnicity. The HWE exact test (implemented by PLINK software) was used. The results are summarized by the QQ plots in Figures 20 and 21 for autosomal and X chromosome SNPs, respectively.
The deviation between observed and expected p-values occurs at approximately 0.01 for autosomal SNPs in both ethnic groups and between 0.001 and 0.0001 for X-chromosome SNPs in females. This difference is not due to the sample size difference, since doing the tests on females only for autosomal SNPs gives essentially the same QQ plots. Nor is it due to the smaller number of X chromosome SNPs, since random samples of the same number of autosomal SNPs show essentially the same difference in QQ plots. At this point, we can only speculate on possible reasons for the X-autosome difference. Some possibilities are: (a) greater stringency in development of X chromosome SNP assays, since they are generally more difficult to call because of male-female differences, (b) less chance of the calling algorithm missing a rare homozygous class (because of hemizygous males) and (c) fewer copy number variations on the X that might interfere with assay performance.
The QQ plot deviation at p-value=0.01 for autosomal SNPs suggests that roughly 1% of SNPs have significant deviation from HWE. If these deviations are due largely to population structure, we expect them to occur mainly in the direction of heterozygote deficiency. To examine this issue, we looked at the distribution of an estimate of the inbreeding coefficient calculated as 1-(number of observed heterozygotes / number of expected heterozygotes), which is positive with heterozygote deficiency and negative with heterozygote excess. Figure 22 shows that the distribution is fairly symmetrical. The mean of the distribution is 0.0010 for European-Americans and 0.0004 for African-Americans. These results suggest that population structure is not the predominant cause of deviations from HWE. We also find that the HWE deviations are uniformly distributed across the autosomes, except for a concentration of SNPs near the HLA region on chromosome 6 (Figure 23). Therefore, it seems likely that a large fraction of the deviations are due to genotyping artifacts.
Some genotyping artifacts are obvious from observing cluster plots of SNPs with extreme deviations from HWE. Two “phenotypes” are common: (a) One homozygous class is frequent, but called as heterozygous (along with the presumably true heterozygotes) because the two clusters are not clearly separated. (b) Apparent null alleles (possibly caused by second site polymorphisms that interfere with the assay) generate multiple genotypic classes that are called incorrectly. Some extreme examples are shown in Figure 24.
Although QQ plots show deviation of observed from expected p-values at about 0.01, we suggest using a filter threshold of p=0.0001, because examination of cluster plots suggests that a large fraction of assays with p>0.0001 have good clustering and genotype calling. The threshold is rather subjective, but we are reluctant to recommend a higher threshold that would eliminate many perfectly good SNPs.
Preliminary association tests
Association tests were done on three subsets of the data. The first subset contains all study samples classified as either case or control, which were further filtered to retain one sample per subject, one subject per family and samples passing quality control (i.e. “scmat.cscntl.02.csv” described in Appendix B). This subset includes subjects of both racial groups (black and white) and both Hispanic status types. The sample size is 1899 cases and 1946 controls. These data were analyzed in R with two logistic regression models which alcohol-dependent case status is the dependent variable. In model (a) the independent variables include one SNP (genotypes coded as 0, 1 or 2), gender, study (COGA, COGEND or FSCD), and two principal components (PC1, PC2) to control for population structure. Model (b) is the same as model (a) except that SNP is omitted from the independent variables. Two test statistics were computed: a likelihood ratio test (likelihood of model a / likelihood of model b) and the Wald test. The QQ plot for the likelihood ratio test is shown in Figure 25a. At least two or three SNPs appear to achieve genome-wide significance. Figure 26 shows that the top four hits each have good cluster plots.
We also analyzed each ethnic group separately, for which Eurpean- and African-American subpopulations were defined by principal components, although the results are very similar for self-identified racial groups. The two models used for estimating the likelihood ratio test are the same as above, except omitting the principal component covariates. The genomic control factor (Devlin and Roeder, 1999) for all unrelated subjects analyzed together is 1.04, compared with 1.03 and 1.02 for the European- and African-American subgroups, respectively. These results suggest that the first two principal component covariates do not fully account for the effects of population structure in the mixed race group.
In performing association tests in R for X-linked SNPs, males genotypes were coded as 0 and 2 (for BY and AY), whereas female genotypes were coded as 0, 1 and 2 (for BB, BA and AA). This differs from association tests in PLINK, where males are coded as 0 and 1 (for BY and AY), whereas female genotypes are coded as 0, 1 and 2 (for BB, BA and AA). The former coding seems appropriate to reflect the fact that, with X inactivation in females, the number of active alleles in homozygous females equals that in hemizygous males. However, some prefer the latter coding, which reflects the number of “A” alleles in the genotype. Wald test results for both methods are provided (R coding of 0,2 for males and PLINK coding of 0,1 for males).
URLs
netCDF data files: http://www.unidata.ucar.edu/software/netcdf/
Illumina HapMap genotypes: ftp://ftp.illumina.com/Whole%20Genome%20Genotyping%20Files/Human1M_product_files/Human1M_FullCall_Reports
Literature Cited
Devlin, B. and K. Roeder. 1999. Genomic control for association studies. Biometrics55: 997-1004.[PubMed ID: 11315092]
Patterson, N., A.L. Price, and D. Reich. 2006. Population structure and eigenanalysis. PLoS Genet2: e190.[PubMed ID: 17194218]
Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M.A. Ferreira, D. Bender, J. Maller, P. Sklar, P.I. de Bakker, M.J. Daly, and P.C. Sham. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet81: 559-575.[PubMed ID: 17701901]
Weir, B.S., A.D. Anderson, and A.B. Hepler. 2006. Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet7: 771-780.[PubMed ID: 16983373]
Appendix A Project participants
Department of Psychiatry, Washington University:
Laura Bierut, Sherri Fisher, John Rice, Nancy Saccone
CIDR, Johns Hopkins University:
Kim Doheny, Kurt Hetrick, Elizabeth Pugh
dbGaP, NCBI:
Mike Feolo, Justin Paschall
GENEVA program, NHGRI: Emily Harris, Teri Manolio
Department of Biostatistics, University of Washington
Tushar Bhangale, Fred Boehm, Ryan Kyle, Cathy Laurie, Thomas Lumley, Ian Painter, Ken Rice, Bruce Weir, Xiuwen Zheng
Appendix B: Sample filter vectors and sample-by-chromosome filter matrices
Sample filter vectors include the following:
- “unrelated” designates a set of samples that includes one sample per subject and one subject per family, and excludes two population group outliers (in “Sample_annotation.csv” file)
- “case.status” designates samples as case, control or other (in “Sample_annotation.csv” file)
- “missing.e2” is the missing call rate per sample (after removing SNPs with “missing.n1”>=5%). This vector is in the “Sample_analysis_results.csv” file.
- “pop.sd2” assigns each sample to a population subgroup based on principal components, where samples within 2 standard deviations of self-identified race and ethnicity have non-missing values of 'eur.non' (European ancestry, non-Hispanic), 'afr.non' (African ancestry, non-Hispanic), or 'his' (Hispanic and not belonging to eur.non or afr.non). This vector is in the “Sample_analysis_results.csv” file.
Sample-by-chromosome matrices include the following. Each matrix is 4316 samples x 26 chromosomes (1-22 autosomes, X, pseudo-autosomal, Y, mitochondrion). Most of these are logical matrices with values of TRUE/FALSE to indicate retaining/eliminating each sample-by-chromosome combination.
- “scmat.anom.csv” is a logical matrix designating chromosomal anomalies
- “scmat.miss.rate.csv” is a matrix of missing call rates
- “scmat.cscntl.02.csv” is a logical matrix with element [i,j] = FALSE when “scmat.anom.csv”[i,j] =FALSE and/or “scmat.miss.rate.csv”[i,j]>=0.05; and the ith row (i.e. sample) [i,]=FALSE when “missing.e2”>=0.02 and/or “unrelated” is not equal to 1 and/or “case.status” is other than ”CASE” or “CNTL”. This matrix was used for association testing with principal components as covariates to account for population structure.
- “scmat.sd2.afr.cscntl.02.csv is a logical matrix with element [i,j] = FALSE when “scmat.anom.csv”[i,j] =FALSE and/or “scmat.miss.rate.csv”[i,j]>=0.05; and the ith row (i.e. sample) [i,]=FALSE when “missing.e2”>=0.02 and/or “unrelated” is not equal to 1 and/or “case.status” is other than ”CASE” or “CNTL” and/or “pop.sd2” is not ”afr.non”. This matrix was used for association testing within the subpopulation having predominantly African-American ancestry.
- “scmat.sd2.eur.cscntl.02.csv is a logical matrix with element [i,j] = FALSE when “scmat.anom.csv”[i,j] =FALSE and/or “scmat.miss.rate.csv”[i,j]>=0.05; and the ith row (i.e. sample) [i,]=FALSE when “missing.e2”>=0.02 and/or “unrelated” is not equal to 1 and/or “case.status” is other than ”CASE” or “CNTL” and/or “pop.sd2” is not “eur.non”. This matrix was used for association testing within the subpopulation having predominantly European-American ancestry.
- “scmat.sd2.afr.cntl.02.csv” is a logical matrix with element [i,j] = FALSE when “scmat.anom.csv”[i,j] =FALSE and/or “scmat.miss.rate.csv”[i,j]>=0.05; and the ith row (i.e. sample) [i,]=FALSE when “missing.e2”>=0.02 and/or “unrelated” is not equal to 1 and/or “case.status” is other than ”CASE” and/or “pop.sd2” is not ”afr.non”. This matrix was used for Hardy-Weinberg equilibrium testing within the subpopulation having predominantly African-American ancestry.
- “scmat.sd2.eur.cntl.02.csv” is a logical matrix with element [i,j] = FALSE when “scmat.anom.csv”[i,j] =FALSE and/or “scmat.miss.rate.csv”[i,j]>=0.05; and the ith row (i.e. sample) [i,]=FALSE when “missing.e2”>=0.02 and/or “unrelated” is not equal to 1 and/or “case.status” is other than ”CASE” and/or “pop.sd2” is not “eur.non”. This matrix was used for Hardy-Weinberg equilibrium testing within the subpopulation having predominantly European-American ancestry.
Tables and Figures
Case status in the three studies.
COGA | COGEND | FSCD | TOTAL | |
---|---|---|---|---|
CASE | 910 | 457 | 577 | 1,944 |
CONTROL | 515 | 945 | 505 | 1,965 |
OTHER | 0 | 18 | 194 | 212 |
TOTAL | 1,425 | 1,420 | 1,276 | 4,121 |
Gender and ethnic composition by case status.
CASE | CONTROL | OTHER | |
---|---|---|---|
N | 1,944 | 1,965 | 212 |
% White | 64.4 | 73.7 | 34.0 |
% Female | 39.2 | 68.8 | 59.0 |
The number of anomalies found by chromosome and tissues types.
Cell Line | Whole Blood | |
---|---|---|
Autosome | 30 | 3 |
X chromosome | 38 | 3 |
Y chromosome | 5 | 2 |
Total | 73 | 8 |
Percent of total | 90.1% | 9.9% |
IBD coefficients for common relationships under the assumption of no inbreeding.
Z2 | Z1 | Z0 | Kinship | Relationship |
---|---|---|---|---|
1.00 | 0.00 | 0.00 | 0.5 | MZ twin or duplicate |
0.00 | 1.00 | 0.00 | 0.25 | parent-offspring |
0.25 | 0.50 | 0.25 | 0.25 | full siblings |
0.00 | 0.50 | 0.50 | 0.125 | half siblings |
0.00 | 0.25 | 0.75 | 0.0625 | cousins |
0.00 | 0.00 | 1.00 | 0 | unrelated |
Number of samples derived from cell line versus whole blood, according to case status.
CASE | CONTROL | OTHER | |
---|---|---|---|
Cell Line | 664 | 493 | 164 |
Whole Blood | 1,280 | 1,472 | 48 |
Whole sample filter summary.
No. of samples | Samples lost | Filter |
---|---|---|
4,316 | 0 | none |
4,289 | 27 | missing call rate per sample >=2% |
4,287 | 2 | population group outliers (PC2<-0.1) |
4,287 | 29 | Total |
99.3% | 0.7% | Percentage |
Percent of potential genotype calls lost due to chromosome-specific sample filters
Chromosome | Chromosome anomalies | Missing call rate per chromosome >=5% | Both |
---|---|---|---|
Autosomes | 0.04% | 0.01% | 0.04% |
X | 0.97% | 0.42% | 1.04% |
Y | 0.46% | 0.72% | 0.81% |
XY | 0.44% | 1.60% | 1.92% |
M | 0.00% | 1.02% | 1.02% |
Summary of SNP filters. The total number of SNP assays attempted is 1,049,008, which is the total number of probes on the chip (1,072,820 minus the number of intensity-only probes). The number of SNPs kept and lost at each step is relative to the previous step.
SNPs kept | SNPs lost | remove SNPs with: |
---|---|---|
1,049,008 | 0 | SNP assays attempted |
1,040,106 | 8,902 | missing call rate = 100% |
1,008,351 | 31,755 | MAF = 0 in all samples |
979,551 | 28,800 | missing call rate >= 2% |
979,551 | 0 | missing call rate >=5% in one or both sexes |
978,716 | 835 | > 1 family with Mendelian error(s) |
977,873 | 843 | > 1 subject with discordant call(s) |
977,860 | 13 | sex difference in allele frequency >=0.2 |
977,860 | 0 | sex difference in heterozygosity >= 0.3 (autosomes and XY only) |
974,029 | 3,831 | Hardy-Weinberg p-value < 1e-4 in either ethnic group |
948,658 | 25,371 | MAF<0.005 in study subjects |
948,658 | 100,350 | Overall |
90.4% | 9.6% | Percentage of assays attempted |
Mean intensity of X chromosome probes versus mean intensity of Y chromosome probes for each sample. Intensity for each probe is the sum of the normalized intensity for each of the two channels. Samples sizes are 40,097 probes for the X chromosome and 2,283 probes for the Y chromosome. The standard error of the mean intensity for each sample ranges from 0.002 to 0.004 for the X chromosome and 0.007 to 0.018 for the Y chromosome.
Schematic diagram showing how trisomic cells (or a mixture of disomic and monosome cells) can result in two different polar coordinate angle (θ) positions for heterozygotes at different loci. “A” and “B” represent two alleles at one locus, where the former is tagged with Cy3 and the latter with Cy5. Loci with allele B on the duplicated chromosome have ABB heterozygotes and those with allele A on the duplicated chromosome have AAB heterozygotes.
Scan of BAlleleFreq (normalized polar coordinate angle) across chromosome 1 for female cell line sample I (different sample and subject from those in Figure 5).
Scan of BAlleleFreq (normalized polar coordinate angle) across the X chromosome for female cell line sample I (different sample and subject from those in Figure 5).
Scan of BAlleleFreq (normalized polar coordinate angle) across chromosome 9 for female cell line sample J (different sample and subject from Figures 4 and 5b).
Scan of BAlleleFreq (normalized polar coordinate angle) across chromosome 11 for female blood sample K (different sample and subject from Figures 4 and 5a).
Identity-by-descent coefficients for all pairs of subjects having a kinship coefficient estimate>0.05 (estimated with the method of moments using PLINK software on all subjects, including HapMap controls, and using ~80k autosomal SNPs).
Estimates of the kinship coefficient for all pairs of samples with values >0.05. In both cases, the subpopulation of black subjects was used. The PLINK estimate (method of moments) used 80k SNPs and the maximum likelihood estimate used 10k SNPs. The red line has a slope of 0 and an intercept of 1.
Principal components 1 and 2 calculated for all unduplicated samples using all autosomal SNPs having a missing call rate <5% (n=994,059). Self-identified ethnicity is indicated by the color and symbol coding. The percent of variation accounted for by each PC is given on the axis label.
Pairwise plots of principal components 1-4. As in Figure 8, these were calculated for all unduplicated samples using all autosomal SNPs having a missing call rate <5% (n=994,059). See Figure 8 for legend. The fraction of variance accounted for by each principal component is given.
Comparison of PC1 calculated with all unduplicated samples (as in Figure 8, which includes some related subjects) and PC1 calculated with a set of unrelated subjects. The correlation is 0.999999951. The red line has an intercept of 0 and slope of 1.
Comparison of PC2 calculated with all unduplicated samples (as in Figure 8, which includes some related subjects) and PC2 calculated with a set of unrelated subjects. The correlation is 0.999973123. The red line has an intercept of 0 and slope of 1.
PC1 versus PC2 for 1254 self-identified black and non-hispanic study subjects. Color coding is for each of the 8 study sites in which subjects were recruited and the symbols distinguish each of the three studies (square = COGA, triangle = COGEND, x=FSCD).
Histogram of the distribution of the missing call rate per sample (“missing.e2”) for all 4324 of the samples originally released.
Boxplot showing the distribution of the mean autosomal missing call rate (“missing.e2”) for 213 genotyping batches.
Plot of genotying batch ethnic composition versus a normalized chi-squared statistic measuring the allelic frequency differences between each batch and a pool of the other batches. The overall fraction of self-identified white samples is 0.67.
Distribution of log10 of the autosomal missing rate per sample (“missing.e2”) for the three classes of subjects.
Distribution of log10 of the autosomal missing rate per sample (“missing.e2”) for samples derives from two tissues types.
The number of SNPs versus the number of families (out of a total of 22 parent-offspring pairs or trios) with at least one Mendelian error detected. The number of SNPs with no errors (offscale) is 99.1% (1,031,181 out of 1,040,106).
The number of SNPs versus the number of subjects (out of a total of 104) with at least one genotype call discordance detected. The number of SNPs with no discordant calls (offscale) is 99.7% (1,016,527 out of 1,040,106).
The discordance rate distribution for 205 duplicate pairs of samples involving 104 different subjects, classified by self-identified ethnicity (48 black and 56 white).
QQ plots for the exact test of Hardy-Weinberg equilibrium on autosomal SNPs. SNPs included in the plots were filtered as in Table 9 (rows preceding “Hardy-Weinberg p-value”). The red line has a slope of 1 and an intercept of 0.
QQ plots for the exact test of Hardy-Weinberg equilibrium on X chromosome SNPs in females. SNPs included in the plots were filtered as in Table 9 (rows preceding “Hardy-Weinberg p-value”). The red line has a slope of 1 and an intercept of 0.
Distribution of estimated inbreeding coefficient within the African-American and European-American ethnic groups. The mean of the distribution is 0.0010 for European-Americans and 0.0004 for African-Americans.
Estimated inbreeding coefficient for SNPs with Hardy-Weinberg Equilibrium test p-value < 0.01 versus position on the chromosome.
Examples of cluster plots for SNPs that have extreme deviations from Hardy-Weinberg equilibrium. Red, green and blue represent AA, AB and BB homozygotes, respectively, and the “x” symbol represents missing genotype calls. The upper left and lower right panels may represent SNPs with null alleles. The upper right and lower left panels may represent SNPs in which one homozygous class is missing, presumably because of poor cluster definition and separation.
QQ plot for likelihood ratio test from logistic regression models (see text for description). The SNPs included in the plot were filtered as described in Table 8. The genomic control factor for (a) all unrelated subjects is 1.04, (b) European-Americans is 1.03 and (c) African-Americans is 1.01.