TY - GEN
T1 - Which distance metric is right
T2 - 12th SIAM International Conference on Data Mining, SDM 2012
AU - Liu, Chuanren
AU - Hu, Tianming
AU - Ge, Yong
AU - Xiong, Hui
PY - 2012
Y1 - 2012
N2 - It is well known that the distance metric plays an important role in the clustering process. Indeed, many clustering problems can be treated as an optimization problem of a criterion function defined over one distance metric. While many distance metrics have been developed, it is not clear that how these distance metrics can impact on the clustering/optimization process. To that end, in this paper, we study the impact of a set of popular cosine-based distance metrics on K-means clustering. Specifically, by revealing the common order-preserving property, we first show that K-means has exactly the same cluster assignment for these metrics during the E-step. Next, by both theoretical and empirical studies, we prove that the cluster centroid is a good approximator of their respective optimal centers in the M-step. As such, we identify a problem with K-means: it cannot differentiate these metrics. To explore the nature of these metrics, we propose an evolutionary K-means framework that integrates K-means and genetic algorithms. This framework not only enables inspection of arbitrary distance metrics, but also can be used to investigate different formulations of the optimization problem. Finally, this framework is used in extensive experiments on real-world data sets. The results validate our theoretical findings on the characteristics and interrelationships of these metrics. Most importantly, this paper furthers our understanding of the impact of the distance metrics on the optimization process of K-means.
AB - It is well known that the distance metric plays an important role in the clustering process. Indeed, many clustering problems can be treated as an optimization problem of a criterion function defined over one distance metric. While many distance metrics have been developed, it is not clear that how these distance metrics can impact on the clustering/optimization process. To that end, in this paper, we study the impact of a set of popular cosine-based distance metrics on K-means clustering. Specifically, by revealing the common order-preserving property, we first show that K-means has exactly the same cluster assignment for these metrics during the E-step. Next, by both theoretical and empirical studies, we prove that the cluster centroid is a good approximator of their respective optimal centers in the M-step. As such, we identify a problem with K-means: it cannot differentiate these metrics. To explore the nature of these metrics, we propose an evolutionary K-means framework that integrates K-means and genetic algorithms. This framework not only enables inspection of arbitrary distance metrics, but also can be used to investigate different formulations of the optimization problem. Finally, this framework is used in extensive experiments on real-world data sets. The results validate our theoretical findings on the characteristics and interrelationships of these metrics. Most importantly, this paper furthers our understanding of the impact of the distance metrics on the optimization process of K-means.
KW - Distance metric
KW - Document clustering
KW - Genetic algorithm
KW - K-means
UR - http://www.scopus.com/inward/record.url?scp=84880225084&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84880225084&partnerID=8YFLogxK
U2 - 10.1137/1.9781611972825.78
DO - 10.1137/1.9781611972825.78
M3 - Conference contribution
AN - SCOPUS:84880225084
SN - 9781611972320
T3 - Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012
SP - 907
EP - 918
BT - Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012
PB - Society for Industrial and Applied Mathematics Publications
Y2 - 26 April 2012 through 28 April 2012
ER -