Welcome to the Qinsun Instruments Co., LTD! Set to the home page | Collect this site
The service hotline


Related Articles

Product Photo

Contact Us

Qinsun Instruments Co., LTD!
Address:NO.258 Banting Road., Jiuting Town, Songjiang District, Shanghai

Your location: Home > Related Articles > What is the role of clustering algorithms in consolidating the underlying construction of AI applications?

What is the role of clustering algorithms in consolidating the underlying construction of AI applications?

Author:QINSUN Released in:2024-01 Click:30

Cluster analysis, also known as cluster analysis, is a statistical analysis method for studying classification problems of samples or indicators, and is also an important algorithm in data mining.

Cluster analysis is composed of several patterns, typically a vector of measurement or a point in a multidimensional space.

Cluster analysis is based on similarity, and there is more similarity between patterns in a cluster than between patterns not in the same cluster.

In business, clustering can help market analysts distinguish different consumer groups from consumer databases and summarize the consumption patterns or habits of each category of consumers. As a module in data mining, it can serve as a separate tool to discover some deep information distributed in the database, summarize the characteristics of each class, or focus attention on a specific class for further analysis; Moreover, cluster analysis can also serve as a preprocessing step for other analysis algorithms in data mining algorithms.

Cluster analysis algorithms can be divided into partitioning methods, hierarchical methods, density based methods, grid based methods, and model based methods.

Cluster requirements

1. Scalability

Many clustering algorithms work well on small datasets with less than 200 data objects; However, a large-scale database may contain millions of objects, and clustering on such a large dataset sample may lead to biased results.

2. Different attributes

Many algorithms are designed to cluster data of numerical types. However, applications may require clustering other types of data, such as binary, categorical/nominal, ordinal, or a mixture of these data types.

3. Any shape

Many clustering algorithms determine clustering based on Euclidean or Manhattan distance metrics. Algorithms based on such distance metrics tend to discover spherical clusters with similar scales and densities. However, a cluster can be of any shape. It is important to propose algorithms that can discover clusters of arbitrary shapes.

4. Domain miniaturization

Many clustering algorithms require users to input certain parameters in clustering analysis, such as the number of clusters they want to generate. The clustering results are highly sensitive to input parameters. Parameters are often difficult to determine, especially for datasets containing high-dimensional objects. This not only increases the burden on users, but also makes it difficult to control the quality of clustering.

5. Dealing with "noise"

The vast majority of real-world databases contain outliers, missing, or erroneous data. Some clustering algorithms are sensitive to such data and may result in low-quality clustering results.

6. Record order

Some clustering algorithms are sensitive to the order of input data. For example, when the same dataset is handed over to the same algorithm in different orders, it may generate significantly different clustering results. Developing algorithms that are insensitive to the order of data input is of great significance.

7. High dimensionality

A database or data warehouse may contain several dimensions or attributes. Many clustering algorithms are good at handling low dimensional data, which may only involve two to three dimensions. The human eye can effectively judge the quality of clustering in multiple dimensions. Clustering data objects in high-dimensional space is very challenging, especially considering that such data may be sparsely distributed and highly skewed.

8. Based on constraints

Real world applications may require clustering under various constraints. Assuming your job is to select placement locations for a given number of ATMs in a city, in order to make a decision, you can cluster residential areas while considering factors such as the city's river and road network, customer requirements in each region, and so on. Finding data groups that meet specific constraints and have good clustering characteristics is a challenging task.

9. Explanatory - Usability

Users hope that the clustering results are interpretable, understandable, and usable. That is to say, clustering may need to be associated with specific semantic interpretations and applications. How application objectives affect the selection of clustering methods is also an important research topic.

Remember these constraints, and we will follow the following steps to learn clustering analysis. Firstly, learn about different types of data and their impact on clustering methods. Next, a general classification of clustering methods is presented. Then we discussed in detail various clustering methods, including partitioning methods, hierarchical methods, density based methods, grid based methods, and model based methods. Later, we will explore clustering and outlier analysis in high-dimensional space.

Algorithm classification

It is difficult to propose a concise classification for clustering methods because these categories may overlap, resulting in a method having several categories of features. However, providing a relatively organized description for different clustering methods is still useful. There are several main methods for clustering analysis calculation, as follows:

1. Division method

Partitioning methods, given a dataset with N tuples or records, the splitting method constructs K groups, each representing a cluster. K: (1) Each group contains at least one data record; (2) Each data record belongs to and only belongs to one group (note: this requirement can be relaxed in some fuzzy clustering algorithms);

For a given K, the algorithm first provides an initial grouping method, and then changes the grouping through repeated iterations, so that each improved grouping scheme is better than the previous one. The so-called good standard is that the closer the records in the same group are, the better, while the farther the records in different groups are, the better.

Most partitioning methods are distance based. Given the number of partitions k to be constructed, the partitioning method first creates an initialization partition. Then, it adopts an iterative repositioning technique to partition objects by moving them from one group to another. The general preparation for a good partition is: objects in the same cluster should be as close or related as possible to each other, while objects in different clusters should be as far away or different as possible. There are many other criteria for evaluating the quality of classification. The traditional partitioning method can be extended to subspace clustering, rather than searching the entire data space. This is useful when there are many attributes and data is sparse. In order to achieve global optimization, partition based clustering may require exhaustively listing all possible partitions, which requires a significant amount of computation. In fact, most applications adopt popular heuristic methods, such as k-means and k-center algorithms, to asymptotically improve clustering quality and approximate local optimal solutions. These heuristic clustering methods are very suitable for discovering spherical clusters in small and medium-sized databases. In order to discover clusters with complex shapes and cluster ultra large datasets, it is necessary to further expand the partition based method.

The algorithms that use this basic idea include K-MEANS algorithm, K-MEDOIDS algorithm, and CLARANS algorithm;

2. Hierarchy Process

Hierarchical methods, which perform hierarchical decomposition on a given dataset until certain conditions are met. It can be further divided into two schemes: "bottom-up" and "top-down".

For example, in the "bottom-up" scheme, initially each data record forms a separate group, and in the following iterations, it combines those adjacent records into a group until all records form a group or a certain condition is met.

Hierarchical clustering methods can be distance based, density based, or connectivity based. Some extensions of hierarchical clustering methods also consider subspace clustering. The drawback of hierarchical methods is that once a step (merge or split) is completed, it cannot be revoked. This strict rule is useful because there is no need to worry about the number of combinations chosen, as it will result in smaller computational costs. However, this technology cannot correct erroneous decisions. Several methods have been proposed to improve the quality of hierarchical clustering.

Representative algorithms include: BIRCH algorithm, CURE algorithm, CHAMELEON algorithm, etc;

3. Density algorithm

Density based methods differ fundamentally from other methods in that they are not based on various distances, but rather on density. This can overcome the disadvantage of distance based algorithms that can only discover clusters with circular shapes.

The guiding principle of this method is to add points in a region whose density exceeds a certain threshold to clusters that are similar to them.

Representative algorithms include DBSCAN algorithm, OPTICS algorithm, DENCLUE algorithm, etc;

4. Graph theory clustering method

The first step in solving graph theory clustering methods is to establish a graph that is suitable for the problem. The nodes of the graph correspond to the small units of the analyzed data, and the edges (or arcs) of the graph correspond to the similarity measurement between the small processing unit data. Therefore, there is a metric expression between each small processing unit data, which ensures that the local characteristics of the data are relatively easy to handle. The graph theory clustering method uses the local connectivity features of sample data as the main information source for clustering, and its main advantage is that it is easy to handle the characteristics of local data.

5. Grid algorithm

Grid based methods, which first divide the data space into a finite number of cell grid structures, all processing is based on a single cell as the object. A prominent advantage of this approach is its fast processing speed, which is usually independent of the number of records in the target database and only depends on how many units the data space is divided into.

Representative algorithms include: STING algorithm, CLIQUE algorithm, WAVE-CLUSTER algorithm;

6. Model algorithm

Model based methods assume a model for each cluster, and then search for datasets that can well satisfy this model. Such a model may be the density distribution function of data points in space or other factors. One potential assumption is that the target dataset is determined by a series of probability distributions.

There are usually two directions to try: statistical approach and neural network approach.