Locating and reconfiguring records in unstructured multiple-record web documents

David W. Embley, L. Xu

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Record extraction from data-rich, unstructured, multiple-record Web documents works well [9], but only if the text for each record can be located and isolated. Although some multiple-record Web documents present records as contiguous, delineated chunks of text (which can thus be located and isolated [10]), many do not. When some values of textual records are factored out, are split unnaturally across boundaries, are joined unnaturally within boundaries, or are linked by ofi-page connectors, or when desired records are interspersed with records that are not of interest, it is dificult to automatically cull records and piece values together to form clean, delineated chunks of text that each represent a single record of interest. In this paper we address this problem and propose an algorithm to find and rearrange (if necessary) records in an HTML document. The essential idea is to attempt to maximize a record-recognition heuristic with respect to a given application ontology. Tests we conducted for two widely difiering applications show that this technique properly locates and reconfigures records.

Original languageEnglish (US)
Pages (from-to)256-274
Number of pages19
JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume1997
DOIs
StatePublished - 2001
Externally publishedYes

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Locating and reconfiguring records in unstructured multiple-record web documents'. Together they form a unique fingerprint.

Cite this