Rezarta Islamaj Dogan's Research Page

The NCBI Disease Corpus

NCBI

CBB

Zhiyong Lu

At a glance

The NCBI disease corpus is fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community.

Corpus Characteristics

793 PubMed abstracts
6,892 disease mentions
790 unique disease concepts
- Medical Subject Headings (MeSH®)
- Online Mendelian Inheritance in Man (OMIM®)
91% of the mentions map to a single disease concept
divided into training, developing and testing sets.

Corpus Annotation

Fourteen annotators
Two-annotators per document (randomly paired)
Three annotation phases
Checked for corpus-wide consistency of annotations

Publications

An improved corpus of disease mentions in PubMed citations ACL-WEB link

NCBI Disease Corpus: A Resource for Disease Name Recognition and Normalization PubMed link

Disease Name Normalization with Pairwise Learning to Rank PubMed link

Fig. 1. The illustration of the annotation process involving 12 annotators working on pairs on 793 PubMed abstracts for disease name recognition covering all the sentences in every PubMed citation.

We welcome your feedback:

Rezarta Islamaj Doğan

Robert Leaman

Zhiyong Lu

Revised: August 27, 2013.