Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

Shahriar Golchin, Mihai Surdeanu, Nazgol Tavabi, Ata Kiapour

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pretrained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7–15% of the pretraining time (for two epochs) for BERT Large (Devlin et al., 2019).

Original languageEnglish (US)
Title of host publicationACL 2023 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023 - Proceedings of the Workshop
EditorsBurcu Can, Maximilian Mozes, Samuel Cahyawijaya, Naomi Saphra, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Chen Zhao, Isabelle Augenstein, Anna Rogers, Kyunghyun Cho, Edward Grefenstette, Lena Voita
PublisherAssociation for Computational Linguistics (ACL)
Pages13-21
Number of pages9
ISBN (Electronic)9781959429777
StatePublished - 2023
Event8th Workshop on Representation Learning for NLP, RepL4NLP 2023, co-located with ACL 2023 - Toronto, Canada
Duration: Jul 13 2023 → …

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

Conference8th Workshop on Representation Learning for NLP, RepL4NLP 2023, co-located with ACL 2023
Country/TerritoryCanada
CityToronto
Period7/13/23 → …

ASJC Scopus subject areas

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords'. Together they form a unique fingerprint.

Cite this