TY - GEN
T1 - Triplet-Trained Vector Space and Sieve-Based Search Improve Biomedical Concept Normalization
AU - Xu, Dongfang
AU - Bethard, Steven
N1 - Funding Information:
Research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under Award Number R01LM012918. The computations were done in systems supported by the National Science Foundation under Grant No. 1228509. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Publisher Copyright:
© 2021 Association for Computational Linguistics
PY - 2021
Y1 - 2021
N2 - Concept normalization, the task of linking textual mentions of concepts to concepts in an ontology, is critical for mining and analyzing biomedical texts. We propose a vector-space model for concept normalization, where mentions and concepts are encoded via transformer networks that are trained via a triplet objective with online hard triplet mining. The transformer networks refine existing pre-trained models, and the online triplet mining makes training efficient even with hundreds of thousands of concepts by sampling training triples within each mini-batch. We introduce a variety of strategies for searching with the trained vector-space model, including approaches that incorporate domain-specific synonyms at search time with no model retraining. Across five datasets, our models that are trained only once on their corresponding ontologies are within 3 points of state-of-the-art models that are retrained for each new domain. Our models can also be trained for each domain, achieving new state-of-the-art on multiple datasets.
AB - Concept normalization, the task of linking textual mentions of concepts to concepts in an ontology, is critical for mining and analyzing biomedical texts. We propose a vector-space model for concept normalization, where mentions and concepts are encoded via transformer networks that are trained via a triplet objective with online hard triplet mining. The transformer networks refine existing pre-trained models, and the online triplet mining makes training efficient even with hundreds of thousands of concepts by sampling training triples within each mini-batch. We introduce a variety of strategies for searching with the trained vector-space model, including approaches that incorporate domain-specific synonyms at search time with no model retraining. Across five datasets, our models that are trained only once on their corresponding ontologies are within 3 points of state-of-the-art models that are retrained for each new domain. Our models can also be trained for each domain, achieving new state-of-the-art on multiple datasets.
UR - http://www.scopus.com/inward/record.url?scp=85123939640&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123939640&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85123939640
T3 - Proceedings of the 20th Workshop on Biomedical Language Processing, BioNLP 2021
SP - 11
EP - 22
BT - Proceedings of the 20th Workshop on Biomedical Language Processing, BioNLP 2021
A2 - Demner-Fushman, Dina
A2 - Cohen, Kevin Bretonnel
A2 - Ananiadou, Sophia
A2 - Tsujii, Junichi
PB - Association for Computational Linguistics (ACL)
T2 - 20th Workshop on Biomedical Language Processing, BioNLP 2021
Y2 - 11 June 2021
ER -