Deciding on Number of Clusters by Multi-Objective Optimization and Validity Analysis
Tansel Özyer and Reda Alhajj
Clustering is unsupervised process that classified a given set of objects into groups. The effectiveness of a clustering approach is mainly judged by its capability of producing clusters by maximizing both: within cluster
similarity and between clusters dissimilarity. However, clustering algorithms expect the number of clusters be specified beforehand; this requires domain expertise. In this study, we demonstrate the effectiveness of different validity indices in guiding the process of a clustering approach that automatically determines the number of clusters before starting the actual clustering process. The target is achieved by first running a multi-objective genetic algorithm on a sample of the given dataset to find the set of alternative solutions for a given range of possible number of clusters. Then, we apply cluster validity indexes to find the most appropriate number of clusters. We decide on running the genetic algorithm on a sample rather than the whole dataset simply because we want to benefit from the power of the genetic algorithm in automatically estimating the number of clusters, without being negatively affected by the poor performance of the genetic algorithm process as the dataset size increases. Finally, we run CURE to do the actual clustering of the whole datset by feeding the determined number of clusters as input. The reported test results on two datasets demonstrate the applicability, efficiency and effectiveness of the proposed approach.
Keywords: CURE, clustering, data mining, genetic algorithm, multi-objective optimization, validity analysis.