TY - GEN
T1 - EntityBERT
T2 - 20th Workshop on Biomedical Language Processing, BioNLP 2021
AU - Lin, Chen
AU - Miller, Timothy
AU - Dligach, Dmitriy
AU - Bethard, Steven
AU - Savova, Guergana
N1 - Funding Information:
The study was funded by R01LM10090, U24CA248010 and UG3CA243120 from the Unites States National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Funding Information:
The study was funded by R01LM10090, U24CA248010 and UG3CA243120 from the Unites States National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors would like to thank the anonymous reviewers for their valuable suggestions and criticism. The authors would also like to acknowledge Boston Children?s Hospital?s High-Performance Computing Resources BCH HPC Cluster Enkefalos 2 (E2) made available for conducting the research reported in this publication. Software used in the project was installed and configured by BioGrids (Morin et al., 2013).
Publisher Copyright:
© 2021 Association for Computational Linguistics
PY - 2021
Y1 - 2021
N2 - Transformer-based neural language models have led to breakthroughs for a variety of natural language processing (NLP) tasks. However, most models are pretrained on general domain data. We propose a methodology to produce a model focused on the clinical domain: continued pretraining of a model with a broad representation of biomedical terminology (PubMedBERT) on a clinical corpus along with a novel entity-centric masking strategy to infuse domain knowledge in the learning process. We show that such a model achieves superior results on clinical extraction tasks by comparing our entity-centric masking strategy with classic random masking on three clinical NLP tasks: cross-domain negation detection (Wu et al., 2014), document time relation (DocTimeRel) classification (Lin et al., 2020b), and temporal relation extraction (Wright-Bettner et al., 2020). We also evaluate our models on the PubMedQA(Jin et al., 2019) dataset to measure the models’ performance on a nonentity-centric task in the biomedical domain. The language addressed in this work is English.
AB - Transformer-based neural language models have led to breakthroughs for a variety of natural language processing (NLP) tasks. However, most models are pretrained on general domain data. We propose a methodology to produce a model focused on the clinical domain: continued pretraining of a model with a broad representation of biomedical terminology (PubMedBERT) on a clinical corpus along with a novel entity-centric masking strategy to infuse domain knowledge in the learning process. We show that such a model achieves superior results on clinical extraction tasks by comparing our entity-centric masking strategy with classic random masking on three clinical NLP tasks: cross-domain negation detection (Wu et al., 2014), document time relation (DocTimeRel) classification (Lin et al., 2020b), and temporal relation extraction (Wright-Bettner et al., 2020). We also evaluate our models on the PubMedQA(Jin et al., 2019) dataset to measure the models’ performance on a nonentity-centric task in the biomedical domain. The language addressed in this work is English.
UR - http://www.scopus.com/inward/record.url?scp=85123932639&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123932639&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85123932639
T3 - Proceedings of the 20th Workshop on Biomedical Language Processing, BioNLP 2021
SP - 191
EP - 201
BT - Proceedings of the 20th Workshop on Biomedical Language Processing, BioNLP 2021
A2 - Demner-Fushman, Dina
A2 - Cohen, Kevin Bretonnel
A2 - Ananiadou, Sophia
A2 - Tsujii, Junichi
PB - Association for Computational Linguistics (ACL)
Y2 - 11 June 2021
ER -