TY - GEN

T1 - Which distance metric is right

T2 - 12th SIAM International Conference on Data Mining, SDM 2012

AU - Liu, Chuanren

AU - Hu, Tianming

AU - Ge, Yong

AU - Xiong, Hui

PY - 2012

Y1 - 2012

N2 - It is well known that the distance metric plays an important role in the clustering process. Indeed, many clustering problems can be treated as an optimization problem of a criterion function defined over one distance metric. While many distance metrics have been developed, it is not clear that how these distance metrics can impact on the clustering/optimization process. To that end, in this paper, we study the impact of a set of popular cosine-based distance metrics on K-means clustering. Specifically, by revealing the common order-preserving property, we first show that K-means has exactly the same cluster assignment for these metrics during the E-step. Next, by both theoretical and empirical studies, we prove that the cluster centroid is a good approximator of their respective optimal centers in the M-step. As such, we identify a problem with K-means: it cannot differentiate these metrics. To explore the nature of these metrics, we propose an evolutionary K-means framework that integrates K-means and genetic algorithms. This framework not only enables inspection of arbitrary distance metrics, but also can be used to investigate different formulations of the optimization problem. Finally, this framework is used in extensive experiments on real-world data sets. The results validate our theoretical findings on the characteristics and interrelationships of these metrics. Most importantly, this paper furthers our understanding of the impact of the distance metrics on the optimization process of K-means.

AB - It is well known that the distance metric plays an important role in the clustering process. Indeed, many clustering problems can be treated as an optimization problem of a criterion function defined over one distance metric. While many distance metrics have been developed, it is not clear that how these distance metrics can impact on the clustering/optimization process. To that end, in this paper, we study the impact of a set of popular cosine-based distance metrics on K-means clustering. Specifically, by revealing the common order-preserving property, we first show that K-means has exactly the same cluster assignment for these metrics during the E-step. Next, by both theoretical and empirical studies, we prove that the cluster centroid is a good approximator of their respective optimal centers in the M-step. As such, we identify a problem with K-means: it cannot differentiate these metrics. To explore the nature of these metrics, we propose an evolutionary K-means framework that integrates K-means and genetic algorithms. This framework not only enables inspection of arbitrary distance metrics, but also can be used to investigate different formulations of the optimization problem. Finally, this framework is used in extensive experiments on real-world data sets. The results validate our theoretical findings on the characteristics and interrelationships of these metrics. Most importantly, this paper furthers our understanding of the impact of the distance metrics on the optimization process of K-means.

KW - Distance metric

KW - Document clustering

KW - Genetic algorithm

KW - K-means

UR - http://www.scopus.com/inward/record.url?scp=84880225084&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84880225084&partnerID=8YFLogxK

U2 - 10.1137/1.9781611972825.78

DO - 10.1137/1.9781611972825.78

M3 - Conference contribution

AN - SCOPUS:84880225084

SN - 9781611972320

T3 - Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012

SP - 907

EP - 918

BT - Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012

PB - Society for Industrial and Applied Mathematics Publications

Y2 - 26 April 2012 through 28 April 2012

ER -