Skip to main page content Skip to main page content

GNormPlus: An Integrative Approach for Tagging Gene, Gene Family and Protein Domain

Authors: Chih-Hsuan Wei, Hung-Yu Kao and Zhiyong Lu (PI)

Research highlights

GNormPlus: an end-to-end system that handles both gene/protein name and identifier detection in biomedical literature, including gene/protein mentions, family names and domain names. Moreover, GNormPlus also integrates several advanced text-mining techniques (i.e., GenNorm, SR4GN, SimConcept, Ab3P and CRF++) for resolving composite gene names. On two public benchmarking datasets, we show that GNormPlus compares favorably to the other state-of-the-art methods.

Method overview

Our proposed approach includes two main steps: mention recognition and concept normalization, respectively. In the mention recognition step, we developed a new module based on CRF++, together with our previous species recognition system (i.e., SR4GN) to recognize gene and species names and match them accordingly. In concept normalization step, we applied our previous system, GenNorm, combined with a composite mention simplification tool (i.e., SimConcept) and an abbreviation resolution tool (i.e., Ab3P) for optimized performance.

Results

The first evaluation is a species-specific experiment where only human genes are considered. GNormPlus was evaluated on the BioCreative II GN test set. We compared GNormPlus with several previously reported systems, including our previous system, GenNorm. In the second experiment, we evaluate GNormPlus in multi-species gene normalization using the BioCreative III GN task data set. GNormPlus presents a competitive performance in both evaluations.

Open source tools Precision Recall F-measure
GNormPlus 87.1% 86.4% 86.7%
GenNorm 78.9% 81.4% 80.1%
GNAT 90.7% 82.4% 86.4%
Table 1. The evaluation of human species gene normalization on the BioCreative II GN test set.
Open source tools TAP-5 TAP-10 TAP-20 F-measure
GNormPlus 33.3% 36.7% 36.7% 50.1%
GenNorm 32.8% 35.5% 35.5% 46.9%
GeneTuKit 29.7% 31.4% 32.5% -
Table 2. The evaluation of multiple species gene normalization on the BioCreative III GN test set.

Downloads

GNormPlus Software in Java or Perl
GNormPlus Corpus
GNormPlus-tagged PubMed results in PubTator
GNormPlus RESTful API

Please cite

  • Wei C-H, Kao H-Y, Lu Z. GNormPlus: An Integrative Approach for Tagging Gene, Gene Family and Protein Domain. BioMed Research International Journal, Text Mining for Translational Bioinformatics special issue, BioMed Research International Journal, Article ID 918710; DOI: dx.doi.org/10.1155/2015/918710 (2015)