Abstract
Today's organizations are continuously capturing extremely large amounts of data, which will only continue to increase. In this paper we present a new approach to discovering clusters in these massive amounts of complex (i.e., multidimensional) continuously-arriving data, which are much too large to be analyzed as one dataset. In order to guarantee acceptable scalability, our approach builds on existing data mining literature and uses sampling-based techniques, an advanced variation of hierarchical agglomerative clustering, and an approach for sample-based cluster reconstruction to provide an approximate clustering solution of very high accuracy. We test the proposed approach empirically and show that it provides excellent clustering performance and, at the same time, demonstrates significant computational savings.
Original language | English (US) |
---|---|
Pages | 121-126 |
Number of pages | 6 |
State | Published - 2008 |
Externally published | Yes |
Event | 2008 Workshop on Information Technologies and Systems, WITS 2008 - Paris, France Duration: Dec 13 2008 → Dec 14 2008 |
Other
Other | 2008 Workshop on Information Technologies and Systems, WITS 2008 |
---|---|
Country/Territory | France |
City | Paris |
Period | 12/13/08 → 12/14/08 |
ASJC Scopus subject areas
- Information Systems
- Control and Systems Engineering