Using symbolic knowledge in the UMLS to disambiguate words in small datasets with a Naïve Bayes classifier

Gondy Leroy, Thomas C. Rindflesch

Research output: Contribution to journalArticlepeer-review

4 Scopus citations


Current approaches to word sense disambiguation use and combine various machine-learning techniques. Most refer to characteristics of the ambiguous word and surrounding words and are based on hundreds of examples. Unfortunately, developing large training sets is time-consuming. We investigate the use of symbolic knowledge to augment machine-learning techniques for small datasets. UMLS semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. A naïve Bayes classifier was trained for 15 words with 100 examples for each. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in eight experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. In a follow-up evaluation, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators.

Original languageEnglish (US)
Pages (from-to)381-385
Number of pages5
JournalStudies in health technology and informatics
StatePublished - 2004


  • Artificial intelligence
  • UMLS
  • Unified Medical Language System
  • machine learning
  • naïve Bayes
  • small datasets
  • symbolic knowledge
  • word sense disambiguation

ASJC Scopus subject areas

  • Biomedical Engineering
  • Health Informatics
  • Health Information Management


Dive into the research topics of 'Using symbolic knowledge in the UMLS to disambiguate words in small datasets with a Naïve Bayes classifier'. Together they form a unique fingerprint.

Cite this