TY - GEN
T1 - Text simplification tools
T2 - 47th Hawaii International Conference on System Sciences, HICSS 2014
AU - Kauchak, David
AU - Mouradi, Obay
AU - Pentoney, Christopher
AU - Leroy, Gondy
PY - 2014
Y1 - 2014
N2 - Although providing understandable information is a critical component in healthcare, few tools exist to help clinicians identify difficult sections in text. We systematically examine sixteen features for predicting the difficulty of health texts using six different machine learning algorithms. Three represent new features not previously examined: medical concept density; specificity (calculated using word-level depth in MeSH); and ambiguity (calculated using the number of UMLS Metathesaurus concepts associated with a word). We examine these features for a binary prediction task on 118,000 simple and difficult sentences from a sentence-aligned corpus. Using all features, random forests is the most accurate with 84% accuracy. Model analysis of the six models and a complementary ablation study shows that the specificity and ambiguity features are the strongest predictors (24% combined impact on accuracy). Notably, a training size study showed that even with a 1% sample (1,062 sentences) an accuracy of 80% can be achieved.
AB - Although providing understandable information is a critical component in healthcare, few tools exist to help clinicians identify difficult sections in text. We systematically examine sixteen features for predicting the difficulty of health texts using six different machine learning algorithms. Three represent new features not previously examined: medical concept density; specificity (calculated using word-level depth in MeSH); and ambiguity (calculated using the number of UMLS Metathesaurus concepts associated with a word). We examine these features for a binary prediction task on 118,000 simple and difficult sentences from a sentence-aligned corpus. Using all features, random forests is the most accurate with 84% accuracy. Model analysis of the six models and a complementary ablation study shows that the specificity and ambiguity features are the strongest predictors (24% combined impact on accuracy). Notably, a training size study showed that even with a 1% sample (1,062 sentences) an accuracy of 80% can be achieved.
UR - http://www.scopus.com/inward/record.url?scp=84902295430&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84902295430&partnerID=8YFLogxK
U2 - 10.1109/HICSS.2014.330
DO - 10.1109/HICSS.2014.330
M3 - Conference contribution
AN - SCOPUS:84902295430
SN - 9781479925049
T3 - Proceedings of the Annual Hawaii International Conference on System Sciences
SP - 2616
EP - 2625
BT - Proceedings of the 47th Annual Hawaii International Conference on System Sciences, HICSS 2014
PB - IEEE Computer Society
Y2 - 6 January 2014 through 9 January 2014
ER -