TY - JOUR
T1 - Combining schema and instance information for integrating heterogeneous data sources
AU - Zhao, Huimin
AU - Ram, Sudha
N1 - Funding Information:
Sudha Ram is Eller Professor of Management Information Systems in the Eller College of Business and Public Administration at the University of Arizona. She received a B.S. Degree in Mathematics, Physics and Chemistry from the University of Madras in 1979, PGDM from the Indian Institute of Management, Calcutta in 1981, and a Ph.D. from the University of Illinois at Urbana-Champaign, in 1985. Dr. Ram has published articles in such journals as Communications of the ACM, IEEE Expert, IEEE Transactions on Knowledge and Data Engineering, Information Systems, Information Systems Research, Management Science, and MIS Quarterly. Her research deals with issues related to Enterprise Data Management. Her research has been funded by organizations such as, IBM, Intel Corporation, Raytheon, US ARMY, NIST, NSF, NASA, and Office of Research and Development of the CIA. Specifically, her research deals with Interoperability among Heterogeneous Database Systems, Semantic Modeling, BioInformatics and Spatio-Temporal Semantics, Business Rules Modeling, Web services Discovery and Selection, and Automated software tools for database design. Dr. Ram serves on editorial board for such journals as Decision Support Systems, Information Systems Frontiers, Journal of Information Technology and Management, and as associate editor for Information Systems Research, Journal of Database Management, and the Journal of Systems and Software. She has chaired several workshops and conferences supported by ACM, IEEE, and AIS. She is a cofounder of the Workshop on Information Technology and Systems (WITS) and serves on the steering committee of many workshops and conferences including the Entity Relationship Conference (ER). Dr. Ram is a member of ACM, IEEE Computer Society, INFORMS, and Association for Information Systems (AIS). She is also the director of the Advanced Database Research Group based at the University of Arizona.
PY - 2007/5
Y1 - 2007/5
N2 - Determining the correspondences among heterogeneous data sources, which is critical to integration of the data sources, is a complex and resource-consuming task that demands automated support. We propose an iterative procedure for detecting both schema-level and instance-level correspondences from heterogeneous data sources. Cluster analysis techniques are used first to identify similar schema elements (i.e., relations and attributes). Based on the identified schema-level correspondences, classification techniques are used to identify matching tuples. Statistical analysis techniques are then applied to a preliminary integrated data set to evaluate the relationships among schema elements more accurately. Improvement in schema-level correspondences triggers another iteration of an iterative procedure. We have performed empirical evaluation using real-world heterogeneous data sources and report in this paper some promising results (i.e., incremental improvement in identified correspondences) that demonstrate the utility of the proposed iterative procedure.
AB - Determining the correspondences among heterogeneous data sources, which is critical to integration of the data sources, is a complex and resource-consuming task that demands automated support. We propose an iterative procedure for detecting both schema-level and instance-level correspondences from heterogeneous data sources. Cluster analysis techniques are used first to identify similar schema elements (i.e., relations and attributes). Based on the identified schema-level correspondences, classification techniques are used to identify matching tuples. Statistical analysis techniques are then applied to a preliminary integrated data set to evaluate the relationships among schema elements more accurately. Improvement in schema-level correspondences triggers another iteration of an iterative procedure. We have performed empirical evaluation using real-world heterogeneous data sources and report in this paper some promising results (i.e., incremental improvement in identified correspondences) that demonstrate the utility of the proposed iterative procedure.
KW - Data integration
KW - Heterogeneous databases
KW - Semantic correspondence
UR - http://www.scopus.com/inward/record.url?scp=33947161876&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33947161876&partnerID=8YFLogxK
U2 - 10.1016/j.datak.2006.06.004
DO - 10.1016/j.datak.2006.06.004
M3 - Article
AN - SCOPUS:33947161876
SN - 0169-023X
VL - 61
SP - 281
EP - 303
JO - Data and Knowledge Engineering
JF - Data and Knowledge Engineering
IS - 2
ER -