Gold standard corpus, ontologies, and Entity-Quality ontology annotations for evolutionary phenotypes



This data set includes a gold-standard corpus of evolutionary phenotype descriptions (in the form of character state descriptions pulled from a variety of phylogenetic systematics studies), and their corresponding expert-curated annotations with ontology terms in the form of Entity-Quality (EQ) statements. EQ annotatons allow machine-reasoning (through the semantics encoded in the requisite ontologies from which the ontology terms are drawn), and machine-reasoning in turn enables computing metrics for quantifying the semantic similarity between different phenotype descriptions as represented by their EQ annotations. Also included are the ontologies, and the human expert-generated and Semantic Charaparser (i.e., machine) generated EQ annotations used to assess Semantic Charaparser performance relative to inter-curator variation and to the effect of having access to external knowledge. The ontologies include those used as input, the "augmented" ontologies created by human curators in each experiment round, and the merged ontology used to maximize Semantic Charaparser's performance. The production of the gold standard corpus, annotation experiments, and evaluation of the results are described in detail in the following manuscript: Dahdul et al (2018) Annotation of phenotypes using ontologies: a Gold Standard for the training and evaluation of natural language processing systems. BioRxiv Submitted to Database. The analysis code for evaluating the gold standard corpus (and the input data and ontologies for that) are available separately from the following: Manda et al (2018) Code and data for analysis of evolutionary phenotype ontology annotations and gold standard corpus. Zenodo. In comparison to the previous version (v1.0.0), this record includes a file of MD5 checksums of the Gold Standard data files. The data files themselves are unchanged.
Date made available2018

