Combining schema and instance information for integrating heterogeneous data sources

Huimin Zhao, Sudha Ram

Research output: Contribution to journalArticlepeer-review

30 Scopus citations

Abstract

Determining the correspondences among heterogeneous data sources, which is critical to integration of the data sources, is a complex and resource-consuming task that demands automated support. We propose an iterative procedure for detecting both schema-level and instance-level correspondences from heterogeneous data sources. Cluster analysis techniques are used first to identify similar schema elements (i.e., relations and attributes). Based on the identified schema-level correspondences, classification techniques are used to identify matching tuples. Statistical analysis techniques are then applied to a preliminary integrated data set to evaluate the relationships among schema elements more accurately. Improvement in schema-level correspondences triggers another iteration of an iterative procedure. We have performed empirical evaluation using real-world heterogeneous data sources and report in this paper some promising results (i.e., incremental improvement in identified correspondences) that demonstrate the utility of the proposed iterative procedure.

Original languageEnglish (US)
Pages (from-to)281-303
Number of pages23
JournalData and Knowledge Engineering
Volume61
Issue number2
DOIs
StatePublished - May 2007

Keywords

  • Data integration
  • Heterogeneous databases
  • Semantic correspondence

ASJC Scopus subject areas

  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Combining schema and instance information for integrating heterogeneous data sources'. Together they form a unique fingerprint.

Cite this