TY - GEN
T1 - A hybrid method for estimating the predominant number of clusters in a data set
AU - Alshaqsi, Jamil
AU - Wang, Wenjia
PY - 2012
Y1 - 2012
N2 - In cluster analysis, finding out the number of clusters, K, for a given dataset is an important yet very tricky task, simply because there is often no universally accepted correct or wrong answer for non-trivial real world problems and it also depends on the context and purpose of a cluster study. This paper presents a new hybrid method for estimating the predominant number of clusters automatically. It employs a new similarity measure and then calculates the length of constant similarity intervals, L and considers the longest consistent intervals representing the most probable numbers of the clusters under the set context. An error function is defined to measure and evaluate the goodness of estimations. The proposed method has been tested on 3 synthetic datasets and 8 real-world benchmark datasets, and compared with some other popular methods. The experimental results showed that the proposed method is able to determine the desired number of clusters for all the simulated datasets and most of the benchmark datasets, and the statistical tests indicate that our method is significantly better.
AB - In cluster analysis, finding out the number of clusters, K, for a given dataset is an important yet very tricky task, simply because there is often no universally accepted correct or wrong answer for non-trivial real world problems and it also depends on the context and purpose of a cluster study. This paper presents a new hybrid method for estimating the predominant number of clusters automatically. It employs a new similarity measure and then calculates the length of constant similarity intervals, L and considers the longest consistent intervals representing the most probable numbers of the clusters under the set context. An error function is defined to measure and evaluate the goodness of estimations. The proposed method has been tested on 3 synthetic datasets and 8 real-world benchmark datasets, and compared with some other popular methods. The experimental results showed that the proposed method is able to determine the desired number of clusters for all the simulated datasets and most of the benchmark datasets, and the statistical tests indicate that our method is significantly better.
KW - cluster analysis
KW - cluster number
KW - similarity measure
UR - http://www.scopus.com/inward/record.url?scp=84873589657&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84873589657&partnerID=8YFLogxK
U2 - 10.1109/ICMLA.2012.146
DO - 10.1109/ICMLA.2012.146
M3 - Conference contribution
AN - SCOPUS:84873589657
SN - 9780769549132
T3 - Proceedings - 2012 11th International Conference on Machine Learning and Applications, ICMLA 2012
SP - 569
EP - 573
BT - Proceedings - 2012 11th International Conference on Machine Learning and Applications, ICMLA 2012
T2 - 11th IEEE International Conference on Machine Learning and Applications, ICMLA 2012
Y2 - 12 December 2012 through 15 December 2012
ER -