Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms

W R Pearson

doi:10.1016/0888-7543(91)90071-l

Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms

Genomics. 1991 Nov;11(3):635-50. doi: 10.1016/0888-7543(91)90071-l.

Author

W R Pearson¹

Affiliation

¹ Department of Biochemistry, University of Virginia, Charlottesville 22908.

PMID: 1774068
DOI: 10.1016/0888-7543(91)90071-l

Abstract

The sensitivity and selectivity of the FASTA and the Smith-Waterman protein sequence comparison algorithms were evaluated using the superfamily classification provided in the National Biomedical Research Foundation/Protein Identification Resource (PIR) protein sequence database. Sequences from each of the 34 superfamilies in the PIR database with 20 or more members were compared against the protein sequence database. The similarity scores of the related and unrelated sequences were determined using either the FASTA program or the Smith-Waterman local similarity algorithm. These two sets of similarity scores were used to evaluate the ability of the two comparison algorithms to identify distantly related protein sequences. The FASTA program using the ktup = 2 sensitivity setting performed as well as the Smith-Waterman algorithm for 19 of the 34 superfamilies. Increasing the sensitivity by setting ktup = 1 allowed FASTA to perform as well as Smith-Waterman on an additional 7 superfamilies. The rigorous Smith-Waterman method performed better than FASTA with ktup = 1 on 8 superfamilies, including the globins, immunoglobulin variable regions, calmodulins, and plastocyanins. Several strategies for improving the sensitivity of FASTA were examined. The greatest improvement in sensitivity was achieved by optimizing a band around the best initial region found for every library sequence. For every superfamily except the globins and immunoglobulin variable regions, this strategy was as sensitive as a full Smith-Waterman. For some sequences, additional sensitivity was achieved by including conserved but nonidentical residues in the lookup table used to identify the initial region.

Publication types

Comparative Study
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Algorithms*
Amino Acid Sequence*
Databases, Factual*
Gene Library
Information Storage and Retrieval*
Molecular Sequence Data
Proteins / classification*
Sensitivity and Specificity
Sequence Alignment
Software

Substances

Proteins

Grants and funding

LM04969/LM/NLM NIH HHS/United States