TY - GEN
T1 - Cheap and Good? Simple and Effective Data Augmentation for Low Resource Machine Reading
AU - Van, Hoang
AU - Yadav, Vikas
AU - Surdeanu, Mihai
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/7/11
Y1 - 2021/7/11
N2 - We propose a simple and effective strategy for data augmentation for low-resource machine reading comprehension (MRC). Our approach first pretrains the answer extraction components of a MRC system on the augmented data that contains approximate context of the correct answers, before training it on the exact answer spans. The approximate context helps the QA method components in narrowing the location of the answers. We demonstrate that our simple strategy substantially improves both document retrieval and answer extraction performance by providing larger context of the answers and additional training data. In particular, our method significantly improves the performance of BERT based retriever (15.12%), and answer extractor (4.33% F1) on TechQA, a complex, low-resource MRC task. Further, our data augmentation strategy yields significant improvements of up to 3.9% exact match (EM) and 2.7% F1 for answer extraction on PolicyQA, another practical but moderate sized QA dataset that also contains long answer spans.
AB - We propose a simple and effective strategy for data augmentation for low-resource machine reading comprehension (MRC). Our approach first pretrains the answer extraction components of a MRC system on the augmented data that contains approximate context of the correct answers, before training it on the exact answer spans. The approximate context helps the QA method components in narrowing the location of the answers. We demonstrate that our simple strategy substantially improves both document retrieval and answer extraction performance by providing larger context of the answers and additional training data. In particular, our method significantly improves the performance of BERT based retriever (15.12%), and answer extractor (4.33% F1) on TechQA, a complex, low-resource MRC task. Further, our data augmentation strategy yields significant improvements of up to 3.9% exact match (EM) and 2.7% F1 for answer extraction on PolicyQA, another practical but moderate sized QA dataset that also contains long answer spans.
KW - data augmentation
KW - document retrieval
KW - question answering
UR - http://www.scopus.com/inward/record.url?scp=85111703738&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85111703738&partnerID=8YFLogxK
U2 - 10.1145/3404835.3463099
DO - 10.1145/3404835.3463099
M3 - Conference contribution
AN - SCOPUS:85111703738
T3 - SIGIR 2021 - Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 2116
EP - 2120
BT - SIGIR 2021 - Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery, Inc
T2 - 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021
Y2 - 11 July 2021 through 15 July 2021
ER -