TY - JOUR
T1 - Combine unsupervised learning and heuristic rules to annotate organism morphological descriptions
AU - Cui, Hong
AU - Singaram, Sriramu
AU - Janning, Alyssa
PY - 2011
Y1 - 2011
N2 - Biodiversity literature is a comprehensive compilation of information on living organisms and fossils. Rich factual information on characteristics of organisms is presented in narrative form, hence limiting its repurpose and reuse. Transforming narrative information into atomic forms has been of special concern to informatics researchers and biological researchers alike. Research done previously shows similar results but lacks a detailed, scientific evaluation that would help illuminate the problem and eventually lead to a higher performance approach. Due to the sublanguage nature of morphological descriptions, it is thought that general-purpose nature language processing (NLP) tools are not effective in this application. A heuristic-based approach has been suggested in the literature. In this paper, we report our experiments with such an approach, where a set of simple, intuitive heuristic rules, informed by results of an unsupervised learning algorithm, is used to segment taxonomic descriptions and identify the organs along with their associated character/value pairs (color=white, shape=ovoid). This model system allows us to investigate the character annotation problem further, study the characteristics of morphological descriptions, identify the areas where the system fails, and suggest ways to address those failures. One such suggestion is to make use of general-purpose syntactic parsers in a controlled manner.
AB - Biodiversity literature is a comprehensive compilation of information on living organisms and fossils. Rich factual information on characteristics of organisms is presented in narrative form, hence limiting its repurpose and reuse. Transforming narrative information into atomic forms has been of special concern to informatics researchers and biological researchers alike. Research done previously shows similar results but lacks a detailed, scientific evaluation that would help illuminate the problem and eventually lead to a higher performance approach. Due to the sublanguage nature of morphological descriptions, it is thought that general-purpose nature language processing (NLP) tools are not effective in this application. A heuristic-based approach has been suggested in the literature. In this paper, we report our experiments with such an approach, where a set of simple, intuitive heuristic rules, informed by results of an unsupervised learning algorithm, is used to segment taxonomic descriptions and identify the organs along with their associated character/value pairs (color=white, shape=ovoid). This model system allows us to investigate the character annotation problem further, study the characteristics of morphological descriptions, identify the areas where the system fails, and suggest ways to address those failures. One such suggestion is to make use of general-purpose syntactic parsers in a controlled manner.
KW - Character annotation
KW - Character markup
KW - Heuristic rules
KW - Semantic annotation technique
KW - Unsupervised machine learning algorithm
UR - http://www.scopus.com/inward/record.url?scp=84861450874&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84861450874&partnerID=8YFLogxK
U2 - 10.1002/meet.2011.14504801031
DO - 10.1002/meet.2011.14504801031
M3 - Article
AN - SCOPUS:84861450874
SN - 0044-7870
VL - 48
JO - Proceedings of the ASIST Annual Meeting
JF - Proceedings of the ASIST Annual Meeting
ER -