Cluster analysis: Difference between revisions

Latest revision as of 19:23, 17 November 2023

Cluster analysis is an explorative procedure to divide data sets into groups with regard to their similarity. Various criteria and characteristics can be used for cluster analysis, on the basis of which the similarity of the individual data is determined. A cluster analysis is based on the calculation of a similarity measure and belong to the unsupervised machine learning methods^[1].

Prerequisites of the cluster analysis

A cluster should be maximally homogeneous within itself and clearly distinguishable from other clusters. A clear demarcation must be ensured. Therefore, the following conditions should be met^[2]^[3]:

Size of the data set: Under certain circumstances, a meaningful result can only be achieved with a sufficiently large data set. Depending on the task, it is therefore necessary to weigh up whether the amount of data is sufficient.
Normalization of the data: if there are large differences in the value range of the data, the data should be normalized beforehand.
Elimination of outliers: outliers can strongly distort the results. Thus, the data should first be analyzed and evaluated for possible extreme values and outliers should then be eliminated.
Bias: If there are strong correlations between the data, the results could end up being heavily biased. This must be avoided.

Procedure of a cluster analysis

In a first step, the determination of characteristics or corresponding similarities takes place. Next, you should select an algorithm that you will use to analyze your data and thus lay the foundation for the formation of clusters. Thirdly, the determination of the number of clusters takes place as well as the formation of the respective clusters. Here, the data is assigned on the basis of segmentation criteria. For the grouping to take place, not only the number of groups must be evaluated, but also a similar cluster size for all your identified clusters^[4].

Cluster analysis methods

There are numerous algorithms for dividing data into clusters. Which method is most suitable generally depends on the question. Often, the results of different methods are compared at the end to determine the correct method. The best known methods are^[5]:

K-Means: The k-Means method is an iterative algorithm. With each iteration step, the cluster centers are newly determined and the similarity of individual data points to the cluster center is reflected by the Euclidean distance. A data point is assigned to a cluster if the Euclidean distance to it is the smallest. This machine learning algorithm is quite simple, but the number of clusters must be determined in advance. A major drawback of this algorithm is also that it is very sensitive to outliers.
Hierarchical Cluster Analysis: This machine learning method is based on distance measures. A distinction is made between the divisive clustering methods and the agglomerative methods. The divisive procedures belong to the top-down procedures, in which initially all objects of the data set belong to a cluster. Then, step by step, more and more clusters are formed. The agglomerative methods, on the other hand, follow the opposite approach (bottom-up methods). Each object first forms its own cluster, and they are merged step by step until all objects belong to one cluster. Once formed, clusters can then no longer be changed. However, how to partition depends on the user. This is beside the complex computation the largest disadvantage of these methods. However, it is not necessary to know the number of clusters beforehand.

Applications of the cluster analysis

Cluster analysis has become a common means of grouping data in a wide variety of fields^[6]:

Marketing: Analyzing customers and sorting them into the right target groups can be an enormous competitive advantage in marketing. Cluster analyses are used here to identify similar customers from the entire customer base and to develop individual advertising strategies for these customers.
Medicine and psychology: Behavioral patterns or disease patterns can also be grouped into clusters. Suitable therapies can then be developed on this basis.

Footnotes

↑ Everitt, Landau, Leese, Stahl, 2011, pp. 2-8.
↑ Aggarwal, Reddy, 2014, pp. 577-583.
↑ Aggarwal, Reddy, 2014, p. 124.
↑ Tian, Xu, 2015, pp. 166.
↑ Aggarwal, Reddy, 2014, pp. 89-105.
↑ Everitt, Landau, Leese, Stahl, 2011, pp. 9-13.

Cluster analysis — recommended articles
Descriptive statistics — Mann-Whitney U test — Control limits — Systematic sampling techniques — Parametric analysis — Two-way ANOVA — Decision tree — CUSUM chart — Multiple regression analysis

References

Aggarwal, C. C., Reddy, C. K. (2014). Data Clustering. Algorithms and Applications, "Chapman & Hall".
Everitt, B. S., Landau, S., Leese, M., Stahl, D. (2011). Cluster Analysis, 5th Edition, "Wiley Series in Propability and Statistics".
Tian, Y., Xu, D. (2015). A Comprehensive Survey of Clustering Algorithms, "Annals of Data Science", 2(2), pp. 165-193.

Author: Max Bachmann

[1] Everitt, Landau, Leese, Stahl, 2011, pp. 2-8.

[2] Aggarwal, Reddy, 2014, pp. 577-583.

[3] Aggarwal, Reddy, 2014, p. 124.

[4] Tian, Xu, 2015, pp. 166.

[5] Aggarwal, Reddy, 2014, pp. 89-105.

[6] Everitt, Landau, Leese, Stahl, 2011, pp. 9-13.

[1]

[2]

[3]

[4]

[5]

[6]

@@ Line 1: / Line 1: @@
-Networks of economic entities, such as industrial enterprises and common production cycles, and organizations that provide support services (banks, consulting and marketing firms, research and educational institutions, insurance companies) form clusters, a '''complex economic system'''.
+'''[[Cluster]] analysis''' is an explorative procedure to divide data sets into groups with regard to their similarity. Various criteria and characteristics can be used for cluster analysis, on the basis of which the similarity of the individual data is determined. A cluster analysis is based on the calculation of a similarity measure and belong to the unsupervised machine learning methods<ref>Everitt, Landau, Leese, Stahl, 2011, pp. 2-8.</ref>.
-Over the past decade, cluster policy has become one of the most important focal points of national policy in developed and developing countries to enhance national and regional competitiveness.
-This idea is spreading in the form of clearly defined policies and other policy initiatives such as regional strategies and activities supporting local production systems.
-Most of today's industrialized economies need the institutional support of firms to become more competitive.
+==Prerequisites of the cluster analysis==
-In Afanasiev M., Korchagina N., and Myasnikova L. (2006) is argued that enterprise consolidation and clustering are currently one of the most effective supports for increasing production efficiency.
+A cluster should be maximally homogeneous within itself and clearly distinguishable from other clusters. A clear demarcation must be ensured. Therefore, the following conditions should be met<ref>Aggarwal, Reddy, 2014, pp. 577-583.</ref><ref>Aggarwal, Reddy, 2014, p. 124.</ref>:
-The global economy has been impacted by trends in the cluster's role expansion.
+* '''Size of the data set:''' Under certain circumstances, a meaningful result can only be achieved with a sufficiently large data set. Depending on the task, it is therefore necessary to weigh up whether the amount of data is sufficient.
-Innovative approaches to creating integrated management forms are required for the modern evolution of economic space across the globe, taking into account factors such as:
+* '''Normalization of the data:''' if there are large differences in the value range of the data, the data should be normalized beforehand.
-* Internal and external regionalisation factors include enhancing regional and national competitiveness.
+* '''Elimination of outliers:''' outliers can strongly distort the results. Thus, the data should first be analyzed and evaluated for possible extreme values and outliers should then be eliminated.
-* An increase in regional investments and innovation.
+* '''Bias:''' If there are strong correlations between the data, the results could end up being heavily biased. This must be avoided.
-* Development of long-term forms of economic and territorial integration.
-* Enhancement of regional and national competitiveness.
-* A rise in globalization processes.
-Therefore, based on the proactive promotion of propellant industries, the cluster principle becomes more relevant in terms of creating clusters of growth poles in the regional economy and increasing the effectiveness of public policy.
+==Procedure of a cluster analysis==
+In a first step, the determination of characteristics or corresponding similarities takes place. Next, you should select an [[algorithm]] that you will use to analyze your data and thus lay the foundation for the formation of clusters. Thirdly, the determination of the number of clusters takes place as well as the formation of the respective clusters. Here, the data is assigned on the basis of segmentation criteria. For the grouping to take place, not only the number of groups must be evaluated, but also a similar cluster size for all your identified clusters<ref>Tian, Xu, 2015, pp. 166.</ref>.
+==Cluster analysis methods==
+There are numerous algorithms for dividing data into clusters. Which [[method]] is most suitable generally depends on the question. Often, the results of different methods are compared at the end to determine the correct method. The best known methods are<ref>Aggarwal, Reddy, 2014, pp. 89-105.</ref>:
+* '''K-Means:''' The k-Means method is an iterative algorithm. With each iteration step, the cluster centers are newly determined and the similarity of individual data points to the cluster center is reflected by the Euclidean distance. A data point is assigned to a cluster if the Euclidean distance to it is the smallest. This machine learning algorithm is quite simple, but the number of clusters must be determined in advance. A major drawback of this algorithm is also that it is very sensitive to outliers.
+* '''Hierarchical Cluster Analysis:''' This machine learning method is based on distance measures. A distinction is made between the divisive clustering methods and the agglomerative methods. The divisive procedures belong to the top-down procedures, in which initially all objects of the data set belong to a cluster. Then, step by step, more and more clusters are formed. The agglomerative methods, on the other hand, follow the opposite approach (bottom-up methods). Each object first forms its own cluster, and they are merged step by step until all objects belong to one cluster. Once formed, clusters can then no longer be changed. However, how to partition depends on the user. This is beside the complex computation the largest disadvantage of these methods. However, it is not necessary to know the number of clusters beforehand.
+==Applications of the cluster analysis==
+Cluster analysis has become a common means of grouping data in a wide variety of fields<ref>Everitt, Landau, Leese, Stahl, 2011, pp. 9-13.</ref>:
+* '''[[Marketing]]:''' Analyzing customers and sorting them into the right target groups can be an enormous [[competitive advantage]] [[in marketing]]. Cluster analyses are used here to identify similar customers from the entire [[customer]] base and to develop individual advertising strategies for these customers.
+* '''Medicine and psychology:''' Behavioral patterns or disease patterns can also be grouped into clusters. Suitable therapies can then be developed on this basis.
-{{a|Francesca Scattolin}}
+==Footnotes==
-[[Category:Economics]]
+<references />
+{{infobox5|list1={{i5link|a=[[Descriptive statistics]]}} &mdash; {{i5link|a=[[Mann-Whitney U test]]}} &mdash; {{i5link|a=[[Control limits]]}} &mdash; {{i5link|a=[[Systematic sampling techniques]]}} &mdash; {{i5link|a=[[Parametric analysis]]}} &mdash; {{i5link|a=[[Two-way ANOVA]]}} &mdash; {{i5link|a=[[Decision tree]]}} &mdash; {{i5link|a=[[CUSUM chart]]}} &mdash; {{i5link|a=[[Multiple regression analysis]]}} }}
+==References==
+* Aggarwal, C. C., Reddy, C. K. (2014). [https://people.cs.vt.edu/~reddy/papers/DCBOOK.pdf ''Data Clustering. Algorithms and Applications''], "Chapman & Hall".
+* Everitt, B. S., Landau, S., Leese, M., Stahl, D. (2011). [https://epdf.tips/cluster-analysis-fifth-edition-wiley-series-in-probability-and-statistics.html ''Cluster Analysis, 5th Edition''], "Wiley Series in Propability and Statistics".
+* Tian, Y., Xu, D. (2015). [https://link.springer.com/content/pdf/10.1007/s40745-015-0040-1.pdf ''A Comprehensive Survey of Clustering Algorithms''], "Annals of Data Science", 2(2), pp. 165-193.
+[[Category: Methods and techniques]]
+{{a|Max Bachmann}}

Anonymous

Search

Cluster analysis: Difference between revisions

Namespaces

More

Page actions

Latest revision as of 19:23, 17 November 2023

Contents

Prerequisites of the cluster analysis

Procedure of a cluster analysis

Cluster analysis methods

Applications of the cluster analysis

Footnotes

References

Navigation

CEOpedia

Table of Contents

Wiki tools

Wiki tools

Anonymous

Search

Cluster analysis: Difference between revisions

Latest revision as of 19:23, 17 November 2023

Prerequisites of the cluster analysis

Procedure of a cluster analysis

Cluster analysis methods

Applications of the cluster analysis

Footnotes

References

Navigation

Table of Contents

Wiki tools

Page tools

Categories