Cluster analysis
Cluster analysis is an explorative procedure to divide data sets into groups with regard to their similarity. Various criteria and characteristics can be used for cluster analysis, on the basis of which the similarity of the individual data is determined. A cluster analysis is based on the calculation of a similarity measure and belong to the unsupervised machine learning methods. There are numerous algorithms for dividing data into clusters. Which method is most suitable generally depends on the question. Often, the results of different methods are compared at the end to determine the correct method.
Prerequisites of the cluster analysis
A cluster should be maximally homogeneous within itself and clearly distinguishable from other clusters. A clear demarcation must be ensured. Therefore, the following conditions should be met:
- Size of the data set: Under certain circumstances, a meaningful result can only be achieved with a sufficiently large data set. Depending on the task, it is therefore necessary to weigh up whether the amount of data is sufficient.
- Normalization of the data: if there are large differences in the value range of the data, the data should be normalized beforehand.
- Elimination of outliers: outliers can strongly distort the results. Thus, the data should first be analyzed and evaluated for possible extreme values and outliers should then be eliminated.
- Bias: If there are strong correlations between the data, the results could end up being heavily biased. This must be avoided.