Data Science Workshop 3 (Part 2): Choosing the number of clusters
Today’s video discusses a way to find the optimal number of clusters, especially when we do not have any benchmark data. The question here is — Given a dataset and no supervision, how can we figure out what number of clusters, k, is giving us the best results?
Contents
The YouTube Video
The YouTube video is here.
Dataset
We applied the k-means clustering algorithm to the Pecan Yield Data.
Clustering evaluation technique used in the video
Using a score called silhouette coefficient, we evaluated the clustering result to find the optimal number of clusters.
There are many other mechanisms to evaluate clusters. Please note that whatever evaluation metrics you use, it is always better to look at data points from different clusters to check why the data points are in different clusters. Clustering helps us get an initial idea about a dataset. Many times, clustering is used for exploratory data analysis.
The following link contains the description of Silhouette score or coefficient along with many other clustering evaluation techniques: https://computing4all.com/courses/introductory-data-science/lessons/evaluation-of-clustering-results/
The Notebook Code
You can download the notebook file from this zip file. After extracting the file, open it with Jupyter Notebook or Jupyter Lab. Keep the Pecan.csv file and the notebook file in the same directory because the read_csv function in the notebook assumes that the Pecan.csv file is in the current directory.
The notebook code is as follows.