TY - GEN
T1 - Do not Mask Randomly
T2 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023, co-located with ACL 2023
AU - Golchin, Shahriar
AU - Surdeanu, Mihai
AU - Tavabi, Nazgol
AU - Kiapour, Ata
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pretrained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7–15% of the pretraining time (for two epochs) for BERT Large (Devlin et al., 2019).
AB - We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pretrained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7–15% of the pretraining time (for two epochs) for BERT Large (Devlin et al., 2019).
UR - http://www.scopus.com/inward/record.url?scp=85174509382&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85174509382&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85174509382
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 13
EP - 21
BT - ACL 2023 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023 - Proceedings of the Workshop
A2 - Can, Burcu
A2 - Mozes, Maximilian
A2 - Cahyawijaya, Samuel
A2 - Saphra, Naomi
A2 - Kassner, Nora
A2 - Ravfogel, Shauli
A2 - Ravichander, Abhilasha
A2 - Zhao, Chen
A2 - Augenstein, Isabelle
A2 - Rogers, Anna
A2 - Cho, Kyunghyun
A2 - Grefenstette, Edward
A2 - Voita, Lena
PB - Association for Computational Linguistics (ACL)
Y2 - 13 July 2023
ER -