TY - GEN
T1 - Combining NLP with evidence-based methods to find text metrics related to perceived and actual text difficulty
AU - Leroy, Gondy
AU - Endicott, James E.
PY - 2012
Y1 - 2012
N2 - Measuring text difficulty is prevalent in health informatics since it is useful for information personalization and optimization. Unfortunately, it is uncertain how best to compute difficulty so that it relates to reader understanding. We aim to create computational, evidence-based metrics of perceived and actual text difficulty. We start with a corpus analysis to identify candidate metrics which are further tested in user studies. Our corpus contains blogs and journal articles (N=1,073) representing easy and difficult text. Using natural language processing, we calculated base grammatical and semantic metrics, constructed new composite metrics (noun phrase complexity and semantic familiarity), and measured the commonly used Flesch-Kincaid grade level. The metrics differed significantly between document types. Nouns were more prevalent but less familiar in difficult text; verbs and function words were more prevalent in easy text. Noun phrase complexity was lower, semantic familiarity was higher and grade levels were lower in easy text. Then, all metrics were tested for their relation to perceived and actual difficulty using follow-up analyses of two user studies conducted earlier. Base metrics and noun phrase complexity correlated significantly with perceived difficulty and could help explain actual difficulty.
AB - Measuring text difficulty is prevalent in health informatics since it is useful for information personalization and optimization. Unfortunately, it is uncertain how best to compute difficulty so that it relates to reader understanding. We aim to create computational, evidence-based metrics of perceived and actual text difficulty. We start with a corpus analysis to identify candidate metrics which are further tested in user studies. Our corpus contains blogs and journal articles (N=1,073) representing easy and difficult text. Using natural language processing, we calculated base grammatical and semantic metrics, constructed new composite metrics (noun phrase complexity and semantic familiarity), and measured the commonly used Flesch-Kincaid grade level. The metrics differed significantly between document types. Nouns were more prevalent but less familiar in difficult text; verbs and function words were more prevalent in easy text. Noun phrase complexity was lower, semantic familiarity was higher and grade levels were lower in easy text. Then, all metrics were tested for their relation to perceived and actual difficulty using follow-up analyses of two user studies conducted earlier. Base metrics and noun phrase complexity correlated significantly with perceived difficulty and could help explain actual difficulty.
KW - Actual difficulty
KW - Health informatics
KW - Natural language processing
KW - Perceived difficulty
KW - Readability
UR - http://www.scopus.com/inward/record.url?scp=84857730515&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84857730515&partnerID=8YFLogxK
U2 - 10.1145/2110363.2110452
DO - 10.1145/2110363.2110452
M3 - Conference contribution
AN - SCOPUS:84857730515
SN - 9781450307819
T3 - IHI'12 - Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
SP - 749
EP - 753
BT - IHI'12 - Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
T2 - 2nd ACM SIGHIT International Health Informatics Symposium, IHI'12
Y2 - 28 January 2012 through 30 January 2012
ER -