An efficient algorithm for optimizing whole genome alignment with noise

Prudence W H Wong; T W Lam; N Lu; H F Ting; S M Yiu

doi:10.1093/bioinformatics/bth308

An efficient algorithm for optimizing whole genome alignment with noise

Bioinformatics. 2004 Nov 1;20(16):2676-84. doi: 10.1093/bioinformatics/bth308. Epub 2004 May 14.

Authors

Prudence W H Wong¹, T W Lam, N Lu, H F Ting, S M Yiu

Affiliation

¹ Department of Computer Science, University of Hong Kong, Hong Kong. whwong@cs.hku.hk

PMID: 15145812
DOI: 10.1093/bioinformatics/bth308

Abstract

Motivation: This paper is concerned with algorithms for aligning two whole genomes so as to identify regions that possibly contain conserved genes. Motivated by existing heuristic-based software tools, we initiate the study of an optimization problem that attempts to uncover conserved genes with a global concern. Another interesting feature in our formulation is the tolerance of noise, which also complicates the optimization problem. A brute-force approach takes time exponential in the noise level.

Results: We show how an insight into the optimization structure can lead to a drastic improvement in the time and space requirement [precisely, to O(k2n2) and O(k2n), respectively, where n is the size of the input and k is the noise level]. The reduced space requirement allows us to implement the new algorithm, called MaxMinCluster, on a PC. It is exciting to see that when tested with different real data sets, MaxMinCluster consistently uncovers a high percentage of conserved genes that have been published by GenBank. Its performance is indeed favorably compared to MUMmer (perhaps the most popular software tool for uncovering conserved genes in a whole-genome scale).

Availability: The source code is available from the website http://www.csis.hku.hk/~colly/maxmincluster/ detailed proof of the propositions can also be found there.

Publication types

Comparative Study
Evaluation Study

MeSH terms

Algorithms*
Chromosome Mapping / methods*
Cluster Analysis
Conserved Sequence / genetics
Pattern Recognition, Automated / methods
Sequence Alignment / methods*
Sequence Analysis, DNA / methods*
Sequence Analysis, Protein / methods*
Sequence Homology, Amino Acid
Sequence Homology, Nucleic Acid
Stochastic Processes