TY - GEN
T1 - Validity Assessment of Legal Will Statements as Natural Language Inference
AU - Kwak, Alice Saebom
AU - Israelsen, Jacob O.
AU - Morrison, Clayton T.
AU - Bambauer, Derek E.
AU - Surdeanu, Mihai
N1 - Publisher Copyright:
© 2022 Association for Computational Linguistics.
PY - 2022
Y1 - 2022
N2 - This work introduces a natural language inference (NLI) dataset that focuses on the validity of statements in legal wills. This dataset is unique because: (a) each entailment decision requires three inputs: the statement from the will, the law, and the conditions that hold at the time of the testator's death; and (b) the included texts are longer than the ones in current NLI datasets. We trained eight neural NLI models in this dataset. All the models achieve more than 80% macro F1 and accuracy, which indicates that neural approaches can handle this task reasonably well. However, group accuracy, a stricter evaluation measure that is calculated with a group of positive and negative examples generated from the same statement as a unit, is in mid 80s at best, which suggests that the models' understanding of the task remains superficial. Further ablative analyses and explanation experiments indicate that all three text segments are used for prediction, but some decisions rely on semantically irrelevant tokens. This indicates that overfitting on these longer texts likely happens, and that additional research is required for this task to be solved.
AB - This work introduces a natural language inference (NLI) dataset that focuses on the validity of statements in legal wills. This dataset is unique because: (a) each entailment decision requires three inputs: the statement from the will, the law, and the conditions that hold at the time of the testator's death; and (b) the included texts are longer than the ones in current NLI datasets. We trained eight neural NLI models in this dataset. All the models achieve more than 80% macro F1 and accuracy, which indicates that neural approaches can handle this task reasonably well. However, group accuracy, a stricter evaluation measure that is calculated with a group of positive and negative examples generated from the same statement as a unit, is in mid 80s at best, which suggests that the models' understanding of the task remains superficial. Further ablative analyses and explanation experiments indicate that all three text segments are used for prediction, but some decisions rely on semantically irrelevant tokens. This indicates that overfitting on these longer texts likely happens, and that additional research is required for this task to be solved.
UR - https://www.scopus.com/pages/publications/85149896906
UR - https://www.scopus.com/pages/publications/85149896906#tab=citedBy
U2 - 10.18653/v1/2022.findings-emnlp.438
DO - 10.18653/v1/2022.findings-emnlp.438
M3 - Conference contribution
AN - SCOPUS:85149896906
T3 - Findings of the Association for Computational Linguistics: EMNLP 2022
SP - 6076
EP - 6085
BT - Findings of the Association for Computational Linguistics
A2 - Goldberg, Yoav
A2 - Kozareva, Zornitsa
A2 - Zhang, Yue
PB - Association for Computational Linguistics (ACL)
T2 - 2022 Findings of the Association for Computational Linguistics: EMNLP 2022
Y2 - 7 December 2022 through 11 December 2022
ER -