TY - JOUR
T1 - Identifying bacterial biotope entities using sequence labeling
T2 - Performance and feature analysis
AU - Mao, Jin
AU - Cui, Hong
N1 - Funding Information:
This study is supported by the US National Science Foundation under Grant Number DEB-1208567.
Publisher Copyright:
© 2018 ASIS&T
PY - 2018/9
Y1 - 2018/9
N2 - Habitat information is important to biodiversity conservation and research. Extracting bacterial biotope entities from scientific publications is important to large scale study of the relationships between bacteria and their living environments. To facilitate the further development of robust habitat text mining systems for biodiversity, following the BioNLP task framework, three sequence labeling techniques, CRFs (Conditional Random Fields), MEMM (Maximum Entropy Markov Model) and SVMhmm (Support Vector Machine) and one classifier, SVMmulticlass, are compared on their performance in identifying three types of bacterial biotope entities: bacteria, habitats and geographical locations. The effectiveness of a variety of basic word formation features, syntactic features, and semantic features are exploited and compared for the three sequence labeling methods. Experiments on two publicly available BioNLP collections show that, in addition to a WordNet feature, word embedding featured clusters (although not trained with the task-specific corpus) consistently improve the performance for all methods on all entity types in both collections. Other features produce various results. Our results also show that when trained on limited corpora, Brown clusters resulted in better performance than word embedding clusters did. Further analysis suggests that the entity recognition performance can be greatly boosted through improving the accuracy of entity boundary identification.
AB - Habitat information is important to biodiversity conservation and research. Extracting bacterial biotope entities from scientific publications is important to large scale study of the relationships between bacteria and their living environments. To facilitate the further development of robust habitat text mining systems for biodiversity, following the BioNLP task framework, three sequence labeling techniques, CRFs (Conditional Random Fields), MEMM (Maximum Entropy Markov Model) and SVMhmm (Support Vector Machine) and one classifier, SVMmulticlass, are compared on their performance in identifying three types of bacterial biotope entities: bacteria, habitats and geographical locations. The effectiveness of a variety of basic word formation features, syntactic features, and semantic features are exploited and compared for the three sequence labeling methods. Experiments on two publicly available BioNLP collections show that, in addition to a WordNet feature, word embedding featured clusters (although not trained with the task-specific corpus) consistently improve the performance for all methods on all entity types in both collections. Other features produce various results. Our results also show that when trained on limited corpora, Brown clusters resulted in better performance than word embedding clusters did. Further analysis suggests that the entity recognition performance can be greatly boosted through improving the accuracy of entity boundary identification.
UR - http://www.scopus.com/inward/record.url?scp=85052494291&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85052494291&partnerID=8YFLogxK
U2 - 10.1002/asi.24032
DO - 10.1002/asi.24032
M3 - Article
AN - SCOPUS:85052494291
SN - 2330-1635
VL - 69
SP - 1134
EP - 1147
JO - Journal of the Association for Information Science and Technology
JF - Journal of the Association for Information Science and Technology
IS - 9
ER -