Categorisation of web documents using extraction ontologies

Li Xu, David W. Embley

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Automatically recognising which HTML documents on the Web contain items of interest for a user is non-trivial. As a step toward solving this problem, we propose an approach based on information-extraction ontologies. Given HTML documents, tables, and forms, our document recognition system extracts expected ontological vocabulary (keywords and keyword phrases) and expected ontological instance data (particular values for ontological concepts). We then use machine-learned rules over this extracted information to determine whether an HTML document contains items of interest. Experimental results show that our ontological approach to categorisation works well, having achieved F-measures above 90% for all applications we tried.

Original languageEnglish (US)
Pages (from-to)3-20
Number of pages18
JournalInternational Journal of Metadata, Semantics and Ontologies
Volume3
Issue number1
DOIs
StatePublished - 2008

Keywords

  • Extraction ontologies
  • Web document categorisation
  • Web document classifcation

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'Categorisation of web documents using extraction ontologies'. Together they form a unique fingerprint.

Cite this