Data Science Workshop 3 (Part 2): Choosing the number of clusters

Today’s video discusses a way to find the optimal number of clusters, especially when we do not have any benchmark data. The question here is — Given a dataset and no supervision, how can we figure out what number of clusters, k, is giving us the best results?

Contents

1 The YouTube Video
2 Dataset
3 Clustering evaluation technique used in the video
4 The Notebook Code

The YouTube Video

The YouTube video is here.

Dataset

We applied the k-means clustering algorithm to the Pecan Yield Data.

Clustering evaluation technique used in the video

Using a score called silhouette coefficient, we evaluated the clustering result to find the optimal number of clusters.

There are many other mechanisms to evaluate clusters. Please note that whatever evaluation metrics you use, it is always better to look at data points from different clusters to check why the data points are in different clusters. Clustering helps us get an initial idea about a dataset. Many times, clustering is used for exploratory data analysis.

The following link contains the description of Silhouette score or coefficient along with many other clustering evaluation techniques: https://computing4all.com/courses/introductory-data-science/lessons/evaluation-of-clustering-results/

The Notebook Code

You can download the notebook file from this zip file. After extracting the file, open it with Jupyter Notebook or Jupyter Lab. Keep the Pecan.csv file and the notebook file in the same directory because the read_csv function in the notebook assumes that the Pecan.csv file is in the current directory.

The notebook code is as follows.

Data Science Workshop 3 (Part 2): Choosing the number of clusters

The YouTube Video

Dataset

Clustering evaluation technique used in the video

The Notebook Code

Data Science Workshop 3 (Part 1): Exploratory Data Analysis using Pandas in Python Programming

Data Science Workshop 4 (Part 1): Prediction using Linear Regression-Based Models

Leave A Reply Cancel reply

The YouTube Video

Dataset

Clustering evaluation technique used in the video

The Notebook Code

Data Science Workshop 3 (Part 1): Exploratory Data Analysis using Pandas in Python Programming

Data Science Workshop 4 (Part 1): Prediction using Linear Regression-Based Models

Leave A Reply Cancel reply

Login with your site account

Register a new account