Identifying bacterial biotope entities using sequence labeling: Performance and feature analysis

Jin Mao, Hong Cui

Research output: Contribution to journalArticlepeer-review

1 Scopus citations


Habitat information is important to biodiversity conservation and research. Extracting bacterial biotope entities from scientific publications is important to large scale study of the relationships between bacteria and their living environments. To facilitate the further development of robust habitat text mining systems for biodiversity, following the BioNLP task framework, three sequence labeling techniques, CRFs (Conditional Random Fields), MEMM (Maximum Entropy Markov Model) and SVMhmm (Support Vector Machine) and one classifier, SVMmulticlass, are compared on their performance in identifying three types of bacterial biotope entities: bacteria, habitats and geographical locations. The effectiveness of a variety of basic word formation features, syntactic features, and semantic features are exploited and compared for the three sequence labeling methods. Experiments on two publicly available BioNLP collections show that, in addition to a WordNet feature, word embedding featured clusters (although not trained with the task-specific corpus) consistently improve the performance for all methods on all entity types in both collections. Other features produce various results. Our results also show that when trained on limited corpora, Brown clusters resulted in better performance than word embedding clusters did. Further analysis suggests that the entity recognition performance can be greatly boosted through improving the accuracy of entity boundary identification.

Original languageEnglish (US)
Pages (from-to)1134-1147
Number of pages14
JournalJournal of the Association for Information Science and Technology
Issue number9
StatePublished - Sep 2018

ASJC Scopus subject areas

  • Information Systems
  • Computer Networks and Communications
  • Information Systems and Management
  • Library and Information Sciences


Dive into the research topics of 'Identifying bacterial biotope entities using sequence labeling: Performance and feature analysis'. Together they form a unique fingerprint.

Cite this