Using BLAST for identifying gene and protein names in journal articles

Gene. 2000 Dec 23;259(1-2):245-52. doi: 10.1016/s0378-1119(00)00431-5.

Abstract

We describe a system which automatically identifies gene and protein names in journal articles, an important and non-trivial first step in knowledge extraction of protein and gene actions. Our system uses a database of gene and protein names and is based on BLAST [Altschul et al., Nucleic Acids Res. 25 (1997) 3389-3402], a popular tool for DNA and protein sequence comparison. We describe a method that consists of mapping sequences of text characters into sequences of nucleotides that can be processed by BLAST. We demonstrate that this approach is feasible: the system matches gene and protein names with a recall of 78.8% and a precision of 71.7%, which includes names that are not part of the system database. An analysis of the results suggests techniques that can be used to improve performance further.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Base Sequence
  • Databases as Topic
  • Genes*
  • Information Storage and Retrieval / methods*
  • Molecular Sequence Data
  • Proteins*
  • Sequence Alignment
  • Sequence Homology, Nucleic Acid
  • Software

Substances

  • Proteins