Hierarchical Clustering Algorithm

Hierarchical clustering is an unsupervised machine learning algorithm for grouping similar data points into clusters. It is called “hierarchical” because the algorithm builds a hierarchy of clusters in which smaller clusters are merged into bigger ones.

Types of Hierarchical Clustering

There are two types of hierarchical clustering algorithms.

Agglomerative Hierarchical Clustering: It starts with individual data points as clusters and merges them successively to form bigger clusters.
Divisive Hierarchical Clustering: It starts with all the data points in a single cluster and divides the cluster into smaller ones.

Here, we will be focusing on the agglomerative clustering algorithm.

Advantages of Hierarchical Clustering

Some of the advantages of hierarchical clustering are listed below.

It does not require the number of clusters to be specified in advance.
It can handle non-linearly separable data. That is, it can handle nonconvex (non-spherical) data.
It provides a dendrogram, which helps visualize the hierarchy of clusters.

Distance measures used in hierarchical clustering

Hierarchical clustering does not require metric properties of distance. Hence, its choice of distance measures is wide. Some are explained below.

Euclidean Distance: It is the most commonly used distance measure in hierarchical clustering. The Euclidean distance between two data points is the straight-line distance between them.
Manhattan Distance: It is also known as the “taxi cab” distance as it measures the distance between two points as if you had to follow the gridlines of a city.
Cosine dissimilarity: Cosine similarity is a measure of similarity between two vectors based on the angle they form with the origin. Generally, the cosine dissimilarity is computed by (1.0-cosine similarity).
Jaccard dissimilarity: The Jaccard coefficient is a measure of similarity between two sets or two vectors. Jaccard dissimilarity is computed based by (1.0-Jaccard coefficient)

Pseudocode of Agglomerative Hierarchical Clustering

Here is a pseudocode for the Agglomerative Hierarchical Clustering algorithm.

Step 1. Initialize a distance matrix to store the distances 
        between each pair of data points.

Step 2. Initialize a cluster list where each data point is 
        a separate cluster.

Step 3. Initialize a cluster distance matrix where each row
        and column is a cluster

Step 4. Repeat the following steps until there is only one 
        cluster left:

   a. Find the two closest clusters in the cluster list and
      merge them into a single cluster

   b. Update the cluster distance matrix to reflect the new
      cluster's distances to the other clusters

   c. Remove the two clusters that were just merged and add
      the new single cluster to the cluster list.

Step 5. Return the final single cluster, which is the 
      hierarchy of clusters.

Merging Two Clusters

In hierarchical clustering, in each iteration closest clusters are merged. Several linkage criteria are used to compute the distance between two clusters.

Single Linkage: It defines the distance between two clusters as the minimum distance between any pair of data points belonging to the two clusters.
Complete Linkage: It defines the distance between two clusters as the maximum distance between any pair of data points belonging to the two clusters.
Average Linkage: It defines the distance between two clusters as the average distance between all pairs of data points belonging to the two clusters.
Centroid Linkage: It defines the distance between two clusters as the distance between the centroids of the two clusters.

Dendrogram for Hierarchical Clustering Algorithms

A dendrogram is a tree-like diagram that is commonly used in hierarchical clustering to visualize the hierarchical relationships between data points. The dendrogram shows how the data points are grouped into clusters at each level of the hierarchy, and the relationships between the clusters. It displays the hierarchy of clusters and the distance between them.

Each leaf node in the dendrogram represents a single data point, and the branches represent the merged clusters at each level. The height of each branch represents the distance or similarity between the two clusters being merged. A vertical line can be drawn to cut the dendrogram at a particular height, corresponding to a particular number of clusters.

The dendrogram can be helpful in determining the optimal number of clusters, as well as interpreting the relationships between the data points. For example, if the dendrogram shows clear cuts between clusters at different heights, it is a good indication that the data contains well-defined clusters. On the other hand, if the dendrogram shows a gradual merging of clusters, it may indicate that the data contains more complex relationships between the data points.

In addition to its use in hierarchical clustering, dendrograms are also used in other areas of data analysis, such as gene expression analysis, where they can be used to visualize the relationships between genes based on their expression patterns.

Limitations

Like any other clustering algorithm, hierarchical clustering has its limitations, such as its sensitivity to noise and outliers. Using a different linkage method and pre-processing the data to remove outliers are possible solutions.

Runtime complexity is a bit high.

Runtime Complexity

The time complexity of the hierarchical agglomeration clustering algorithm can be expressed as O(n^3), where n is the number of data points. The reason for this high computational cost is that at each step of the algorithm, all pairwise distances between the data points need to be computed and compared to determine the closest two clusters to merge. The time complexity of divisive hierarchical clustering is O(kn^2), if the hierarchy is constructed from one cluster down to k clusters.

This means that hierarchical clustering can become very slow for large datasets, especially when using a linkage method such as complete linkage or average linkage, which requires computing the distance between all pairs of data points in each cluster. Various approximations and optimizations have been proposed to reduce the computational cost, such as using a k-d tree to speed up the distance calculations or using a greedy algorithm to perform the merging.

Despite its high computational cost, hierarchical clustering remains a popular algorithm due to its ability to capture hierarchical relationships between data points and its flexibility regarding the number of clusters. In practice, hierarchical clustering is often used in combination with other techniques, such as dimensionality reduction or subsampling, to make it more computationally feasible for large datasets.

2 Comments

Kelvin Ofori-Minta

September 13, 2022

This is was an intensive session for an online module. Great content!

abdul

June 26, 2021

I think it is used more in Unsupervised learning in macine learning , it is one of the 5 clustering algorithms .