TY - JOUR
T1 - Entity matching across heterogeneous data sources
T2 - An approach based on constrained cascade generalization
AU - Zhao, Huimin
AU - Ram, Sudha
N1 - Funding Information:
Sudha Ram is McClelland Professor of Management Information Systems in the Eller School of Management at the University of Arizona. She received her MBA from the Indian Institute of Management, Calcutta in 1981 and a Ph.D. from the University of Illinois at Urbana-Champaign, in 1985. Dr. Ram has published articles in such journals as Communications of the ACM, IEEE Transactions on Knowledge and Data Engineering, Information Systems, Information Systems Research, Management Science, and MIS Quarterly. Dr. Ram’s research deals with issues related to Enterprise Data Management. Her research has been funded by organizations such as SAP, IBM, Intel Corporation, Raytheon, US ARMY, NIST, NSF, NASA, and Office of Research and Development of the CIA. Specifically, her research deals with Interoperability among Heterogeneous Database Systems, Semantic Modeling, and BioInformatics and Spatio-Temporal Semantics, Dr. Ram serves as a senior editor for Information Systems Research. She also serves on editorial board for such journals as Decision Support Systems, Information Systems Frontiers, Journal of Information Technology and Management, and as associate editor for Journal of Database Management, and the Journal of Systems and Software. She has chaired several workshops and conferences supported by ACM, IEEE, and AIS. She is a cofounder of the Workshop on Information Technology and Systems (WITS) and serves on the steering committee of many workshops and conferences and is currently the chair of the steering committee for the Entity Relationship Conference (ER). Dr. Ram is a member of ACM, IEEE Computer Society, INFORMS, and Association for Information Systems (AIS). She is also the director of the Advanced Database Research Group based at the University of Arizona.
PY - 2008/9
Y1 - 2008/9
N2 - To integrate or link the data stored in heterogeneous data sources, a critical problem is entity matching, i.e., matching records representing semantically corresponding entities in the real world, across the sources. While decision tree techniques have been used to learn entity matching rules, most decision tree learners have an inherent representational bias, that is, they generate univariate trees and restrict the decision boundaries to be axis-orthogonal hyper-planes in the feature space. Cascading other classification methods with decision tree learners can alleviate this bias and potentially increase classification accuracy. In this paper, the authors apply a recently-developed constrained cascade generalization method in entity matching and report on empirical evaluation using real-world data. The evaluation results show that this method outperforms the base classification methods in terms of classification accuracy, especially in the dirtiest case.
AB - To integrate or link the data stored in heterogeneous data sources, a critical problem is entity matching, i.e., matching records representing semantically corresponding entities in the real world, across the sources. While decision tree techniques have been used to learn entity matching rules, most decision tree learners have an inherent representational bias, that is, they generate univariate trees and restrict the decision boundaries to be axis-orthogonal hyper-planes in the feature space. Cascading other classification methods with decision tree learners can alleviate this bias and potentially increase classification accuracy. In this paper, the authors apply a recently-developed constrained cascade generalization method in entity matching and report on empirical evaluation using real-world data. The evaluation results show that this method outperforms the base classification methods in terms of classification accuracy, especially in the dirtiest case.
KW - Cascade generalization
KW - Decision tree
KW - Entity matching
KW - Heterogeneous databases
KW - Record linkage
UR - http://www.scopus.com/inward/record.url?scp=47849087202&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=47849087202&partnerID=8YFLogxK
U2 - 10.1016/j.datak.2008.04.007
DO - 10.1016/j.datak.2008.04.007
M3 - Article
AN - SCOPUS:47849087202
SN - 0169-023X
VL - 66
SP - 368
EP - 381
JO - Data and Knowledge Engineering
JF - Data and Knowledge Engineering
IS - 3
ER -