Clustering schema elements for semantic integration of heterogeneous data sources

Huimin Zhao, Sudha Ram

Research output: Contribution to journalArticlepeer-review

27 Scopus citations

Abstract

Interschema relationship identification (IRI), that is, determining the relationships among schema elements in heterogeneous data sources, is an important step in integrating the data sources. This article proposes a cluster analysis based approach to semi-automating the IRI process, which is typically very time-consuming and requires extensive human interaction. The authors apply multiple clustering techniques, including K-means, hierarchical clustering, and self-organizing map (SOM) neural network, to identify similar schema elements from heterogeneous data sources, based on a combination of features such as naming similarity, document similarity, schema specification, data patterns, and usage patterns. An SOM prototype the authors have developed provides users with a visualization tool for display of clustering results as well as for incremental evaluation of candidate similar elements.

Original languageEnglish (US)
Pages (from-to)88-106
Number of pages19
JournalJournal of Database Management
Volume15
Issue number4
DOIs
StatePublished - 2004

Keywords

  • Attribute correspondence
  • Cluster analysis
  • Heterogeneous database integration
  • Interschema relationship identification
  • Schema correspondence
  • Self-organizing map

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Clustering schema elements for semantic integration of heterogeneous data sources'. Together they form a unique fingerprint.

Cite this