TY - JOUR
T1 - Unified medical language system resources improve sieve-based generation and bidirectional encoder representations from transformers (BERT)–based ranking for concept normalization
AU - Xu, Dongfang
AU - Gopale, Manoj
AU - Zhang, Jiacheng
AU - Brown, Kris
AU - Begoli, Edmon
AU - Bethard, Steven
N1 - Funding Information:
This work was supported in part by National Institutes of Health grant R01LM012918 from the National Library of Medicine (Site PI: SB). Part of the computations were done in systems supported by the National Science Foundation under Grant No. 1228509. This work has been in part coauthored by UT-Battelle, LLC, under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health, National Science Foundation, UT-Battelle, or the Department of Energy.
Publisher Copyright:
© The Author(s) 2020.
PY - 2020/10/1
Y1 - 2020/10/1
N2 - Objective: Concept normalization, the task of linking phrases in text to concepts in an ontology, is useful for many downstream tasks including relation extraction, information retrieval, etc. We present a generate-and-rank concept normalization system based on our participation in the 2019 National NLP Clinical Challenges Shared Task Track 3 Concept Normalization. Materials and Methods: The shared task provided 13 609 concept mentions drawn from 100 discharge summaries. We first design a sieve-based system that uses Lucene indices over the training data, Unified Medical Language System (UMLS) preferred terms, and UMLS synonyms to generate a list of possible concepts for each mention. We then design a listwise classifier based on the BERT (Bidirectional Encoder Representations from Transformers) neural network to rank the candidate concepts, integrating UMLS semantic types through a regularizer. Results: Our generate-and-rank system was third of 33 in the competition, outperforming the candidate generator alone (81.66% vs 79.44%) and the previous state of the art (76.35%). During postevaluation, the model’s accuracy was increased to 83.56% via improvements to how training data are generated from UMLS and incorporation of our UMLS semantic type regularizer. Discussion: Analysis of the model shows that prioritizing UMLS preferred terms yields better performance, that the UMLS semantic type regularizer results in qualitatively better concept predictions, and that the model performs well even on concepts not seen during training. Conclusions: Our generate-and-rank framework for UMLS concept normalization integrates key UMLS features like preferred terms and semantic types with a neural network–based ranking model to accurately link phrases in text to UMLS concepts.
AB - Objective: Concept normalization, the task of linking phrases in text to concepts in an ontology, is useful for many downstream tasks including relation extraction, information retrieval, etc. We present a generate-and-rank concept normalization system based on our participation in the 2019 National NLP Clinical Challenges Shared Task Track 3 Concept Normalization. Materials and Methods: The shared task provided 13 609 concept mentions drawn from 100 discharge summaries. We first design a sieve-based system that uses Lucene indices over the training data, Unified Medical Language System (UMLS) preferred terms, and UMLS synonyms to generate a list of possible concepts for each mention. We then design a listwise classifier based on the BERT (Bidirectional Encoder Representations from Transformers) neural network to rank the candidate concepts, integrating UMLS semantic types through a regularizer. Results: Our generate-and-rank system was third of 33 in the competition, outperforming the candidate generator alone (81.66% vs 79.44%) and the previous state of the art (76.35%). During postevaluation, the model’s accuracy was increased to 83.56% via improvements to how training data are generated from UMLS and incorporation of our UMLS semantic type regularizer. Discussion: Analysis of the model shows that prioritizing UMLS preferred terms yields better performance, that the UMLS semantic type regularizer results in qualitatively better concept predictions, and that the model performs well even on concepts not seen during training. Conclusions: Our generate-and-rank framework for UMLS concept normalization integrates key UMLS features like preferred terms and semantic types with a neural network–based ranking model to accurately link phrases in text to UMLS concepts.
KW - Concept normalization
KW - Deep learning
KW - Generate-and-rank
KW - Natural language processing
KW - Unified medical language system
UR - http://www.scopus.com/inward/record.url?scp=85093539184&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85093539184&partnerID=8YFLogxK
U2 - 10.1093/jamia/ocaa080
DO - 10.1093/jamia/ocaa080
M3 - Article
C2 - 32719838
AN - SCOPUS:85093539184
SN - 1067-5027
VL - 27
SP - 1510
EP - 1519
JO - Journal of the American Medical Informatics Association
JF - Journal of the American Medical Informatics Association
IS - 10
ER -