NOCEA: A rule-based evolutionary algorithm for efficient and effective clustering of massive high-dimensional databases

Ioannis A. Sarafis, Phil W. Trinder, A. M S Zalzala

Research output: Contribution to journalArticlepeer-review

23 Citations (Scopus)

Abstract

Clustering is a descriptive data mining task aiming to group the data into homogeneous groups. This paper presents a novel evolutionary algorithm (NOCEA) that efficiently and effectively clusters massive numerical databases. NOCEA evolves individuals of variable-length consisting of disjoint and axis-aligned hyper-rectangular rules with homogeneous data distribution. The antecedent part of the rules includes an interval-like condition for each dimension. A novel quantisation algorithm imposes a regular multi-dimensional grid structure onto the data space to reduce the search combinations. Due to quantisation the boundaries of the intervals are encoded as integer values. The evolutionary search is guided by a simple data coverage maximisation function. The enormous data space is effectively explored by task-specific recombination and mutation operators producing candidate solutions with no overlapping rules. A parsimony generalisation operator shortens the discovered knowledge by replacing adjacent rules with more generic ones. NOCEA employs a special homogeneity operator that enforces quasi-uniform data distribution in the space enclosed by the candidate rules. After convergence the discovered knowledge undergoes simplification to perform subspace clustering, and to assemble the clusters. Results using real-world datasets are included to show that NOCEA has several attractive properties for clustering including: (a) comprehensible output in the form of disjoint and homogeneous rules, (b) the ability to discover clusters of arbitrary shape, density, size, and data coverage, (c) ability to perform effective subspace clustering, (d) near linear scalability with the database size, data and cluster dimensionality, and (e) substantial potential for task parallelism (speedup of 13.8 on 16 processors). A real-world example is a detailed study of the seismicity along the African-Eurasian-Arabian plate boundaries. © 2006 Elsevier B.V. All rights reserved.

Original languageEnglish
Pages (from-to)668-710
Number of pages43
JournalApplied Soft Computing
Volume7
Issue number3
DOIs
Publication statusPublished - Jun 2007

Keywords

  • Clustering
  • Data mining
  • Earthquake analysis
  • Evolutionary algorithms
  • Rule extraction

Fingerprint Dive into the research topics of 'NOCEA: A rule-based evolutionary algorithm for efficient and effective clustering of massive high-dimensional databases'. Together they form a unique fingerprint.

Cite this