TY - JOUR
T1 - Measuring text difficulty using parse-tree frequency
AU - Kauchak, David
AU - Leroy, Gondy
AU - Hogue, Alan
N1 - Funding Information:
Research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under Award Number R01LM011975. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Publisher Copyright:
© 2017 ASIS&T
PY - 2017/9
Y1 - 2017/9
N2 - Text simplification often relies on dated, unproven readability formulas. As an alternative and motivated by the success of term familiarity, we test a complementary measure: grammar familiarity. Grammar familiarity is measured as the frequency of the 3rd level sentence parse tree and is useful for evaluating individual sentences. We created a database of 140K unique 3rd level parse structures by parsing and binning all 5.4M sentences in English Wikipedia. We then calculated the grammar frequencies across the corpus and created 11 frequency bins. We evaluate the measure with a user study and corpus analysis. For the user study, we selected 20 sentences randomly from each bin, controlling for sentence length and term frequency, and recruited 30 readers per sentence (N = 6,600) on Amazon Mechanical Turk. We measured actual difficulty (comprehension) using a Cloze test, perceived difficulty using a 5-point Likert scale, and time taken. Sentences with more frequent grammatical structures, even with very different surface presentations, were easier to understand, perceived as easier, and took less time to read. Outcomes from readability formulas correlated with perceived but not with actual difficulty. Our corpus analysis shows how the metric can be used to understand grammar regularity in a broad range of corpora.
AB - Text simplification often relies on dated, unproven readability formulas. As an alternative and motivated by the success of term familiarity, we test a complementary measure: grammar familiarity. Grammar familiarity is measured as the frequency of the 3rd level sentence parse tree and is useful for evaluating individual sentences. We created a database of 140K unique 3rd level parse structures by parsing and binning all 5.4M sentences in English Wikipedia. We then calculated the grammar frequencies across the corpus and created 11 frequency bins. We evaluate the measure with a user study and corpus analysis. For the user study, we selected 20 sentences randomly from each bin, controlling for sentence length and term frequency, and recruited 30 readers per sentence (N = 6,600) on Amazon Mechanical Turk. We measured actual difficulty (comprehension) using a Cloze test, perceived difficulty using a 5-point Likert scale, and time taken. Sentences with more frequent grammatical structures, even with very different surface presentations, were easier to understand, perceived as easier, and took less time to read. Outcomes from readability formulas correlated with perceived but not with actual difficulty. Our corpus analysis shows how the metric can be used to understand grammar regularity in a broad range of corpora.
UR - http://www.scopus.com/inward/record.url?scp=85020958561&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85020958561&partnerID=8YFLogxK
U2 - 10.1002/asi.23855
DO - 10.1002/asi.23855
M3 - Article
AN - SCOPUS:85020958561
SN - 2330-1635
VL - 68
SP - 2088
EP - 2100
JO - Journal of the Association for Information Science and Technology
JF - Journal of the Association for Information Science and Technology
IS - 9
ER -