Record extraction from data-rich, unstructured, multiple-record Web documents works well , but only if the text for each record can be located and isolated. Although some multiple-record Web documents present records as contiguous, delineated chunks of text (which can thus be located and isolated ), many do not. When some values of textual records are factored out, are split unnaturally across boundaries, are joined unnaturally within boundaries, or are linked by ofi-page connectors, or when desired records are interspersed with records that are not of interest, it is dificult to automatically cull records and piece values together to form clean, delineated chunks of text that each represent a single record of interest. In this paper we address this problem and propose an algorithm to find and rearrange (if necessary) records in an HTML document. The essential idea is to attempt to maximize a record-recognition heuristic with respect to a given application ontology. Tests we conducted for two widely difiering applications show that this technique properly locates and reconfigures records.
|Original language||English (US)|
|Number of pages||19|
|Journal||Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)|
|State||Published - 2001|
ASJC Scopus subject areas
- Theoretical Computer Science
- Computer Science(all)