The NCBI disease corpus is fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community.
Corpus Characteristics
- 793 PubMed abstracts
- 6,892 disease mentions
- 790 unique disease concepts
- Medical Subject Headings (MeSH®)
- Online Mendelian Inheritance in Man (OMIM®)
- 91% of the mentions map to a single disease concept
- divided into training, developing and testing sets.
Corpus Annotation
- Fourteen annotators
- Two-annotators per document (randomly paired)
- Three annotation phases
- Checked for corpus-wide consistency of annotations
|
An improved corpus of disease mentions in PubMed citations
ACL-WEB link
|
NCBI Disease Corpus: A Resource for Disease Name Recognition and Normalization
PubMed link
|
Disease Name Normalization with Pairwise Learning to Rank
PubMed link
|
|
Fig. 1. The illustration of the annotation process involving
12 annotators working on pairs on 793 PubMed abstracts for disease name recognition
covering all the sentences in every PubMed citation.
|
We welcome your feedback:
|
Revised: August 27, 2013.
|