Estimating the predominant number of clusters in a dataset

Jamil Al Shaqsi, Wenjia Wang

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

In cluster analysis, finding the number of clusters, K, for a given dataset is an important yet very tricky task, simply for the facts that there is no universally accepted correct or wrong answer for most real world problems and it all depends on the context and purpose of a cluster study. Numerous methods have been developed for estimating K, but most are not widely used in practice due to their poor performance. Thus, it is still quite common that human user is required to select a specific value or a range for K for many clustering methods before they are used. Inappropriate predetermination for K can result in poor clustering results. This paper presents a new method for estimating the most probable number of clusters automatically. It firstly calculates the length of constant similarity intervals, L, and then considers the longest ones as the representations of the most probable numbers of the clusters under the set context and the chosen similarity measure. An error function is defined to measure and evaluate the goodness of estimations. The proposed method has been tested on 3 synthetic datasets and 8 real-world benchmark datasets, and compared with some other popular methods including particularly the TwoStep implemented in IBM/SPSS Modeler software package. The experimental results showed that the proposed method is able to find the "desired" predominant number of clusters for all the simulated datasets and most of the benchmark datasets, and the statistical tests indicate that our method is significantly better.

Original languageEnglish
Pages (from-to)603-626
Number of pages24
JournalIntelligent Data Analysis
Volume17
Issue number4
DOIs
Publication statusPublished - 2013

Fingerprint

Statistical tests
Cluster analysis
Number of Clusters
Software packages
Probable
Benchmark
Error function
Cluster Analysis
Statistical test
Clustering Methods
Similarity Measure
Software Package
Clustering
Calculate
Interval
Evaluate
Experimental Results
Range of data

Keywords

  • Cluster analysis
  • cluster number
  • cluster validity
  • similarity measure

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Vision and Pattern Recognition
  • Artificial Intelligence

Cite this

Estimating the predominant number of clusters in a dataset. / Al Shaqsi, Jamil; Wang, Wenjia.

In: Intelligent Data Analysis, Vol. 17, No. 4, 2013, p. 603-626.

Research output: Contribution to journalArticle

@article{ae1e96bcae4143e484c1f01db8afa670,
title = "Estimating the predominant number of clusters in a dataset",
abstract = "In cluster analysis, finding the number of clusters, K, for a given dataset is an important yet very tricky task, simply for the facts that there is no universally accepted correct or wrong answer for most real world problems and it all depends on the context and purpose of a cluster study. Numerous methods have been developed for estimating K, but most are not widely used in practice due to their poor performance. Thus, it is still quite common that human user is required to select a specific value or a range for K for many clustering methods before they are used. Inappropriate predetermination for K can result in poor clustering results. This paper presents a new method for estimating the most probable number of clusters automatically. It firstly calculates the length of constant similarity intervals, L, and then considers the longest ones as the representations of the most probable numbers of the clusters under the set context and the chosen similarity measure. An error function is defined to measure and evaluate the goodness of estimations. The proposed method has been tested on 3 synthetic datasets and 8 real-world benchmark datasets, and compared with some other popular methods including particularly the TwoStep implemented in IBM/SPSS Modeler software package. The experimental results showed that the proposed method is able to find the {"}desired{"} predominant number of clusters for all the simulated datasets and most of the benchmark datasets, and the statistical tests indicate that our method is significantly better.",
keywords = "Cluster analysis, cluster number, cluster validity, similarity measure",
author = "{Al Shaqsi}, Jamil and Wenjia Wang",
year = "2013",
doi = "10.3233/IDA-130596",
language = "English",
volume = "17",
pages = "603--626",
journal = "Intelligent Data Analysis",
issn = "1088-467X",
publisher = "IOS Press",
number = "4",

}

TY - JOUR

T1 - Estimating the predominant number of clusters in a dataset

AU - Al Shaqsi, Jamil

AU - Wang, Wenjia

PY - 2013

Y1 - 2013

N2 - In cluster analysis, finding the number of clusters, K, for a given dataset is an important yet very tricky task, simply for the facts that there is no universally accepted correct or wrong answer for most real world problems and it all depends on the context and purpose of a cluster study. Numerous methods have been developed for estimating K, but most are not widely used in practice due to their poor performance. Thus, it is still quite common that human user is required to select a specific value or a range for K for many clustering methods before they are used. Inappropriate predetermination for K can result in poor clustering results. This paper presents a new method for estimating the most probable number of clusters automatically. It firstly calculates the length of constant similarity intervals, L, and then considers the longest ones as the representations of the most probable numbers of the clusters under the set context and the chosen similarity measure. An error function is defined to measure and evaluate the goodness of estimations. The proposed method has been tested on 3 synthetic datasets and 8 real-world benchmark datasets, and compared with some other popular methods including particularly the TwoStep implemented in IBM/SPSS Modeler software package. The experimental results showed that the proposed method is able to find the "desired" predominant number of clusters for all the simulated datasets and most of the benchmark datasets, and the statistical tests indicate that our method is significantly better.

AB - In cluster analysis, finding the number of clusters, K, for a given dataset is an important yet very tricky task, simply for the facts that there is no universally accepted correct or wrong answer for most real world problems and it all depends on the context and purpose of a cluster study. Numerous methods have been developed for estimating K, but most are not widely used in practice due to their poor performance. Thus, it is still quite common that human user is required to select a specific value or a range for K for many clustering methods before they are used. Inappropriate predetermination for K can result in poor clustering results. This paper presents a new method for estimating the most probable number of clusters automatically. It firstly calculates the length of constant similarity intervals, L, and then considers the longest ones as the representations of the most probable numbers of the clusters under the set context and the chosen similarity measure. An error function is defined to measure and evaluate the goodness of estimations. The proposed method has been tested on 3 synthetic datasets and 8 real-world benchmark datasets, and compared with some other popular methods including particularly the TwoStep implemented in IBM/SPSS Modeler software package. The experimental results showed that the proposed method is able to find the "desired" predominant number of clusters for all the simulated datasets and most of the benchmark datasets, and the statistical tests indicate that our method is significantly better.

KW - Cluster analysis

KW - cluster number

KW - cluster validity

KW - similarity measure

UR - http://www.scopus.com/inward/record.url?scp=84881420811&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84881420811&partnerID=8YFLogxK

U2 - 10.3233/IDA-130596

DO - 10.3233/IDA-130596

M3 - Article

AN - SCOPUS:84881420811

VL - 17

SP - 603

EP - 626

JO - Intelligent Data Analysis

JF - Intelligent Data Analysis

SN - 1088-467X

IS - 4

ER -