Ontology-guided feature engineering for clinical text classification

J Biomed Inform. 2012 Oct;45(5):992-8. doi: 10.1016/j.jbi.2012.04.010. Epub 2012 May 9.

Abstract

In this study we present novel feature engineering techniques that leverage the biomedical domain knowledge encoded in the Unified Medical Language System (UMLS) to improve machine-learning based clinical text classification. Critical steps in clinical text classification include identification of features and passages relevant to the classification task, and representation of clinical text to enable discrimination between documents of different classes. We developed novel information-theoretic techniques that utilize the taxonomical structure of the Unified Medical Language System (UMLS) to improve feature ranking, and we developed a semantic similarity measure that projects clinical text into a feature space that improves classification. We evaluated these methods on the 2008 Integrating Informatics with Biology and the Bedside (I2B2) obesity challenge. The methods we developed improve upon the results of this challenge's top machine-learning based system, and may improve the performance of other machine-learning based clinical text classification systems. We have released all tools developed as part of this study as open source, available at http://code.google.com/p/ytex.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms*
  • Cardiovascular Diseases
  • Data Mining
  • Databases as Topic / classification
  • Humans
  • Medical Informatics Applications
  • Models, Theoretical
  • Natural Language Processing*
  • Obesity
  • Semantics
  • Unified Medical Language System