TY - JOUR
T1 - Resolving "orphaned" non-specific structures using machine learning and natural language processing methods
AU - Xu, Dongfang
AU - Chong, Steven S.
AU - Rodenhausen, Thomas
AU - Cui, Hong
N1 - Funding Information:
This material is based upon work supported by the National Science Foundation under Grant No. NSF DBI-1147266. The authors also thank anonymous reviewers for their constructive suggestions that helped to improve this paper.
Funding Information:
This material is based upon work supported by the National Science Foundation under Grant No. NSF DBI-1147266. The authors also thank anonymous reviewers for their constructive suggestions that helped to improve this paper
Publisher Copyright:
© Xu D et al.
PY - 2018
Y1 - 2018
N2 - Scholarly publications of biodiversity literature contain a vast amount of information in human readable format. The detailed morphological descriptions in these publications contain rich information that can be extracted to facilitate analysis and computational biology research. However, the idiosyncrasies of morphological descriptions still pose a number of challenges to machines. In this work, we demonstrate the use of two different approaches to resolve meronym (i.e. part-of) relations between anatomical parts and their anchor organs, including a syntactic rule-based approach and a SVM-based (support vector machine) method. Both methods made use of domain ontologies. We compared the two approaches with two other baseline methods and the evaluation results show the syntactic methods (92.1% F1 score) outperformed the SVM methods (80.7% F1 score) and the part-of ontologies were valuable knowledge sources for the task. It is notable that the mistakes made by the two approaches rarely overlapped. Additional tests will be conducted on the development version of the Explorer of Taxon Concepts toolkit before we make the functionality publicly available. Meanwhile, we will further investigate and leverage the complementary nature of the two approaches to further drive down the error rate, as in practical application, even a 1% error rate could lead to hundreds of errors.
AB - Scholarly publications of biodiversity literature contain a vast amount of information in human readable format. The detailed morphological descriptions in these publications contain rich information that can be extracted to facilitate analysis and computational biology research. However, the idiosyncrasies of morphological descriptions still pose a number of challenges to machines. In this work, we demonstrate the use of two different approaches to resolve meronym (i.e. part-of) relations between anatomical parts and their anchor organs, including a syntactic rule-based approach and a SVM-based (support vector machine) method. Both methods made use of domain ontologies. We compared the two approaches with two other baseline methods and the evaluation results show the syntactic methods (92.1% F1 score) outperformed the SVM methods (80.7% F1 score) and the part-of ontologies were valuable knowledge sources for the task. It is notable that the mistakes made by the two approaches rarely overlapped. Additional tests will be conducted on the development version of the Explorer of Taxon Concepts toolkit before we make the functionality publicly available. Meanwhile, we will further investigate and leverage the complementary nature of the two approaches to further drive down the error rate, as in practical application, even a 1% error rate could lead to hundreds of errors.
KW - Anaphora resolution
KW - Biodiversity literature
KW - Information extraction
KW - Machine learning
KW - Morphological descriptions
KW - Ontology application
KW - Performance evaluation
UR - http://www.scopus.com/inward/record.url?scp=85057456634&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85057456634&partnerID=8YFLogxK
U2 - 10.3897/BDJ.6.e26659
DO - 10.3897/BDJ.6.e26659
M3 - Article
AN - SCOPUS:85057456634
SN - 1314-2836
VL - 6
JO - Biodiversity Data Journal
JF - Biodiversity Data Journal
M1 - e26659
ER -