EntityBERT: Entity-centric Masking Strategy for Model Pretraining for the Clinical Domain

Chen Lin, Timothy Miller, Dmitriy Dligach, Steven Bethard, Guergana Savova

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Transformer-based neural language models have led to breakthroughs for a variety of natural language processing (NLP) tasks. However, most models are pretrained on general domain data. We propose a methodology to produce a model focused on the clinical domain: continued pretraining of a model with a broad representation of biomedical terminology (PubMedBERT) on a clinical corpus along with a novel entity-centric masking strategy to infuse domain knowledge in the learning process. We show that such a model achieves superior results on clinical extraction tasks by comparing our entity-centric masking strategy with classic random masking on three clinical NLP tasks: cross-domain negation detection (Wu et al., 2014), document time relation (DocTimeRel) classification (Lin et al., 2020b), and temporal relation extraction (Wright-Bettner et al., 2020). We also evaluate our models on the PubMedQA(Jin et al., 2019) dataset to measure the models’ performance on a nonentity-centric task in the biomedical domain. The language addressed in this work is English.

Original languageEnglish (US)
Title of host publicationProceedings of the 20th Workshop on Biomedical Language Processing, BioNLP 2021
EditorsDina Demner-Fushman, Kevin Bretonnel Cohen, Sophia Ananiadou, Junichi Tsujii
PublisherAssociation for Computational Linguistics (ACL)
Pages191-201
Number of pages11
ISBN (Electronic)9781954085404
StatePublished - 2021
Externally publishedYes
Event20th Workshop on Biomedical Language Processing, BioNLP 2021 - Virtual, Online
Duration: Jun 11 2021 → …

Publication series

NameProceedings of the 20th Workshop on Biomedical Language Processing, BioNLP 2021

Conference

Conference20th Workshop on Biomedical Language Processing, BioNLP 2021
CityVirtual, Online
Period6/11/21 → …

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science Applications
  • Information Systems
  • Software
  • Biomedical Engineering
  • Health Informatics

Fingerprint

Dive into the research topics of 'EntityBERT: Entity-centric Masking Strategy for Model Pretraining for the Clinical Domain'. Together they form a unique fingerprint.

Cite this