A novel clustering algorithm with a new similarity measure and ensemble methods for mixed data clustering

Research output: ThesisDoctoral Thesis

Abstract

This thesis addressed some specific issues in clustering: (1) clustering algorithms, (2) similarity measures, (3) number of clusters, K, and (4) clustering ensemble methods. Following on an in-depth review of clustering methods, a new three staged (3-Staged) clustering algorithm is proposed, with new three key aspects: (1) a new method for automatically estimating the K value, (2) a new similarity measure and (3) initiating the clustering process with a promising BASE. A BASE is a real sample that acts like a centroid or a medoid in common clustering methods but it is determined differently in our approach. A new similarity measure is defined particularly to reflect the degree of relative change between data samples, and more importantly to be able to accommodate numerical and categorical variables. We have proven mathematically that the proposed similarity measure meets the three properties of the metric measure. This research also investigated the problem of determining the appropriate number of clusters in a dataset and devised a novel function, which is integrated into our 3-Staged clustering algorithm, to automatically estimate the most appropriate number of clusters, K. Based on our new 3-Staged clustering algorithm, we developed two new ensemble algorithms. For all experiments, we used publicly available real-world benchmark datasets as these datasets have been commonly used by other researchers. Experimental results showed that the 3- Staged clustering algorithm performed better than the compared individual methods including K-means, TwoStep and also some ensemble based methods such as K-ANMI, and ccdByEnsemble. They also showed that the proposed similarity measure is very effective in improving the clustering quality. Besides, they showed that our proposed method for estimating the K value identified the correct number of clusters for most of the tested datasets.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • University of East Anglia
Supervisors/Advisors
  • Wang, Wenjia, Supervisor, External person
  • Rayward Smith, Vic , Supervisor, External person
Award dateOct 10 2010
Place of PublicationUnited Kingdom
Edition1
Electronic ISBNs0000 0004 2700 321X
Publication statusPublished - Oct 10 2010

Keywords

  • Clustering
  • similarity measure
  • number of Clusters
  • threshold value
  • ensemble clustering
  • number of K

Cite this