Using Lexical Chains to Identify Text Difficulty: A Corpus Statistics and Classification Study

Partha Mukherjee, Gondy Leroy, David Kauchak

Research output: Contribution to journalArticlepeer-review

14 Scopus citations

Abstract

Our goal is data-driven discovery of features for text simplification. In this paper, we investigate three types of lexical chains: exact, synonymous, and semantic. A lexical chain links semantically related words in a document. We examine their potential with a document-level corpus statistics study (914 texts) to estimate their overall capacity to differentiate between easy and difficult text and a classification task (11 000 sentences) to determine usefulness of features at sentence-level for simplification. For the corpus statistics study we tested five document-level features for each chain type: total number of chains, average chain length, average chain span, number of crossing chains, and the number of chains longer than half the document length. We found significant differences between easy and difficult text for average chain length and the average number of cross chains. For the sentence classification study, we compared the lexical chain features to standard bag-of-words features on a range of classifiers: logistic regression, naïve Bayes, decision trees, linear and RBF kernel SVM, and random forest. The lexical chain features performed significantly better than the bag-of-words baseline across all classifiers with the best classifier achieving an accuracy of ∼90% (compared to 78% for bag-of-words). Overall, we find several lexical chain features provide specific information useful for identifying difficult sentences of text, beyond what is available from standard lexical features.

Original languageEnglish (US)
Article number8565884
Pages (from-to)2164-2173
Number of pages10
JournalIEEE Journal of Biomedical and Health Informatics
Volume23
Issue number5
DOIs
StatePublished - Sep 2019

Keywords

  • Health informatics
  • SVM
  • classification
  • decision trees
  • logistic regression
  • natural language processing
  • naïve Bayes
  • random forest
  • readability
  • text difficulty
  • text simplification

ASJC Scopus subject areas

  • Health Information Management
  • Health Informatics
  • Electrical and Electronic Engineering
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Using Lexical Chains to Identify Text Difficulty: A Corpus Statistics and Classification Study'. Together they form a unique fingerprint.

Cite this