Multifactor dimensionality reduction analysis identifies specific nucleotide patterns promoting genetic polymorphisms

BioData Min. 2009 Mar 30;2(1):2. doi: 10.1186/1756-0381-2-2.

Abstract

Background: The fidelity of DNA replication serves as the nidus for both genetic evolution and genomic instability fostering disease. Single nucleotide polymorphisms (SNPs) constitute greater than 80% of the genetic variation between individuals. A new theory regarding DNA replication fidelity has emerged in which selectivity is governed by base-pair geometry through interactions between the selected nucleotide, the complementary strand, and the polymerase active site. We hypothesize that specific nucleotide combinations in the flanking regions of SNP fragments are associated with mutation.

Results: We modeled the relationship between DNA sequence and observed polymorphisms using the novel multifactor dimensionality reduction (MDR) approach. MDR was originally developed to detect synergistic interactions between multiple SNPs that are predictive of disease susceptibility. We initially assembled data from the Broad Institute as a pilot test for the hypothesis that flanking region patterns associate with mutagenesis (n = 2194). We then confirmed and expanded our inquiry with human SNPs within coding regions and their flanking sequences collected from the National Center for Biotechnology Information (NCBI) database (n = 29967) and a control set of sequences (coding region) not associated with SNP sites randomly selected from the NCBI database (n = 29967). We discovered seven flanking region pattern associations in the Broad dataset which reached a minimum significance level of p </= 0.05. Significant models (p << 0.001) were detected for each SNP type examined in the larger NCBI dataset. Importantly, the flanking region models were elongated or truncated depending on the nucleotide change. Additionally, nucleotide distributions differed significantly at motif sites relative to the type of variation observed. The MDR approach effectively discerned specific sites within the flanking regions of observed SNPs and their respective identities, supporting the collective contribution of these sites to SNP genesis.

Conclusion: The present study represents the first use of this computational methodology for modeling nonlinear patterns in molecular genetics. MDR was able to identify distinct nucleotide patterning around sites of mutations dependent upon the observed nucleotide change. We discovered one flanking region set that included five nucleotides clustered around a specific type of SNP site. Based on the strongly associated patterns identified in this study, it may become possible to scan genomic databases for such clustering of nucleotides in order to predict likely sites of future SNPs, and even the type of polymorphism most likely to occur.