Cluster analysis

Cluster analysis is a data analysis technique that partitions a set of objects into groups called clusters, where objects within each cluster exhibit greater similarity to one another than to objects in other clusters. The method belongs to unsupervised machine learning and serves as a fundamental tool in exploratory data analysis across disciplines including statistics, pattern recognition, bioinformatics, and market research^[1].

Origins and historical development

The roots of cluster analysis trace back to anthropology. In 1932, Harold E. Driver and Alfred L. Kroeber published "Quantitative Expression of Cultural Relationships" in the University of California Publications in Archaeology and Ethnology, where they sought to classify cultures based on different cultural elements^[2]. Joseph Zubin introduced the method to psychology in 1938. Robert Tryon coined the term "cluster analysis" in his 1939 work "Cluster Analysis: Correlation Profile and Orthometric (factor) Analysis for the Isolation of Unities in Mind and Personality," published by Edwards Brothers.

Raymond Cattell applied cluster analysis to trait theory classification in personality psychology beginning in 1943. The method gained widespread scientific acceptance following Robert Sokal and Peter Sneath's 1963 publication "Principles of Numerical Taxonomy," which motivated global research on clustering methods.

Types of clustering algorithms

More than 100 clustering algorithms have been published to date. They differ significantly in how they define a cluster and how they find clusters efficiently.

Centroid-based clustering

K-means clustering remains the most popular algorithm. It partitions data space into distinct clusters by assigning each data point to the nearest centroid, then recalculating centroids iteratively until convergence.

Hierarchical clustering

This approach arranges data into a tree structure (dendrogram) to identify patterns. Two main subtypes exist:

Agglomerative clustering uses a bottom-up approach, starting with individual data points and merging them progressively
Divisive clustering takes a top-down approach, beginning with all data in one cluster and splitting it recursively

Density-based clustering

Algorithms such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise), OPTICS, and HDBSCAN define clusters as connected dense regions in data space. These methods excel at discovering clusters of arbitrary shape.

Distribution-based clustering

Distribution models utilize statistical distributions. The expectation-maximization algorithm, which employs multivariate normal distributions, represents this category. Model-based clustering treats data as arising from a mixture of probability distributions.

Fuzzy clustering

Traditional methods assign each data point to exactly one cluster. Fuzzy clustering extends this by allowing data points to belong to multiple clusters with varying degrees of membership. This proves useful when clusters overlap or boundaries are ambiguous.

Applications

Cluster analysis finds application in numerous domains:

Market segmentation - grouping customers by purchasing behavior or demographics
Social network analysis - identifying communities within networks
Medical imaging - detecting tumors or anatomical structures
Bioinformatics - grouping genes with similar expression patterns
Image compression and computer graphics
Document classification and information retrieval

Challenges and limitations

Several challenges complicate cluster analysis. Defining similarity can be subjective and context-dependent. Determining the optimal number of clusters often relies on domain knowledge or heuristic methods like the elbow method. Many algorithms struggle with scalability when applied to large datasets. High-dimensional data presents problems due to the curse of dimensionality, where distance measures become less meaningful as dimensions increase.

{{{Concept}}} Primary topic {{{list1}}} Related topics {{{list2}}} Methods and techniques {{{list3}}}

References

Driver, H.E. and Kroeber, A.L. (1932). Quantitative expression of cultural relationships. University of California Publications in American Archaeology and Ethnology, 31, 211-256.
Tryon, R.C. (1939). Cluster Analysis: Correlation Profile and Orthometric (factor) Analysis for the Isolation of Unities in Mind and Personality. Edwards Brothers.
Sokal, R.R. and Sneath, P.H.A. (1963). Principles of Numerical Taxonomy. W.H. Freeman.
Jain, A.K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666.

Footnotes

Jain, A.K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666.
Driver, H.E. and Kroeber, A.L. (1932). Quantitative expression of cultural relationships. University of California Publications in American Archaeology and Ethnology, 31, 211-256.

Author

Sławomir Wawak