Home » Machine Learning » What is Clustering: Grouping Data in Machine Learning

What is Clustering: Grouping Data in Machine Learning

Clustering is a popular technique we can use to group similar data points together. Furthermore, it’s a part of unsupervised learning methods.

Therefore, we don’t need a labeled dataset, because the algorithm will separate data into groups all on its own.

We can find this type of algorithms in various different applications, including image recognition, natural language processing, image segmentation and more.

Clustering can also be very useful in combination with supervised learning methods such as classification and regression.

For example, we can first use clustering process to divide dataset into groups. And after that, we build a classifier to improve the classification performance of the algorithm overall.

Types of Clustering

K-means

K-means is one of the most popular choices of clustering techniques nowadays. Furthermore, this method will group data points into k clusters, where k represents an arbitrary number, which a user defines.

This type of algorithms works by iteratively changing cluster centroids, which are data points that represent each cluster. Furthermore, they change these centroids toward the mean of all data points within each cluster.

This process repeats until there are no more changes in cluster assignments. In other words, the algorithm minimizes the sum of squares of distances between the data points and their respective cluster centroid.

Hierarchical

Another popular type of these algorithms is hierarchical clustering. Furthermore, there are 2 different approaches with this method. But both include making clusters by merging or splitting them across the hierarchy.

First method we’re going to go over, is what we call agglomerative type, which is essentially bottom-up approach. Here, we begin with each data point as its own cluster and merge them together further we go to the top.

On the other hand, the second method, which we call divisive type, works in the opposite direction. Therefore, we begin at the top where all data points belong to one cluster and split into more clusters futher down the hierarchy we go. In other words, we call this a top-down approach.

Conclusion

In conclusion, clustering is a powerful technique, which also belongs under unsupervised type of machine learning. It’s also a very versatile tool, which we can enhance even further with other machine learning methods.

I hope this article helped you gain a better understanding about clustering in general and perhaps even inspire you to learn even more about it.