Over the last couple of articles, We learned different classification and regression algorithms. Now in this article, We are going to learn entirely another type of algorithm. Which falls into the unsupervised learning algorithms.
If you were not aware of unsupervised learning algorithms, all the machine learning algorithms mainly classified into two main categories.
 Supervised learning algorithms
 Unsupervised learning algorithms
All the classification and regression algorithms belong to supervised learning algorithms. The other set of algorithms which fall under unsupervised learning algorithms are clustering algorithms.
In fact, the foremost algorithms to study in unsupervised learning algorithms is clustering analysis algorithms. Today we are going to learn an algorithm to perform the cluster analysis.
We have a decent number of algorithms to perform cluster analysis; In this article, we will be learning how to perform the clustering with the Hierarchical clustering algorithm.
Before we drive further. Let’s have a look at the table of contents.
Table of contents:
 What is clustering analysis?
 Clustering analysis example
 Hierarchical clustering
 Dendrogram
 Agglomerative clustering
 Divisive clustering
 Clustering linkage comparison
 Implementing hierarchical clustering in R programming language
 Data preparation
 Packages need to perform hierarchical clustering
 Visualizing clustering in 3d view
 Complete code
 Summary
 Related courses
 Exploratory data analysis in r
 Clustering analysis in Python
 Machine learning clustering and information retrieval
What is clustering analysis?
Clustering the name itself has a deep meaning about the ongoing process which happens in the cluster analysis. The fundamental problem clustering address is to divide the data into meaningful groups (clusters).
When we say, meaningful groups, the meaningfulness purely depends on the intention behind the purpose of the group’s formations.
Suppose if we intend to cluster the search results for a particular keyword. We Intended to find all the search results which were meaningfully similar to the search keyword.
If we intend to cluster the search results for a particular location, then we need to group the search results belongs to one specific place.
The identified cluster elements within the same cluster should be similar to each other when compared to the other cluster elements.
Suppose we have 100 articles and we want to group them into different categories. Let’s consider the below categories.
 Sports articles
 Business articles
 Entertainment articles
When we group all the 100 articles into the above 3 categories. All the articles belong to the sports category will be same, In the sense, the content in the sports articles belongs to sports category.
When you pick an article from sports category and the other article from business articles. Contentwise they will be completely different. This summarises the rule of thumb condition to form clusters.
All the elements in the same cluster should be similar and elements of the different cluster should not be similar.
Now let’s have a look at one clustering analysis example to understand it better.
Clustering analysis example
Suppose from the above example of clustering gender based on hair length. You can notice all the female cluster elements are so close to each other, Whereas the male cluster elements are far from the female cluster elements.
Here it’s easy to say the clusters in the same cluster (female or male clusters ) elements are so close together by visualizing the plot. We use different clustering algorithms to create clusters by following the thumb rule all the elements in the same cluster has to be close together.
To calculate this closeness we use different similarity measures. These similarity measures determine whether the given point is closer by giving the similarity score.
Hierarchical clustering
Hierarchical clustering is an alternative approach to kmeans clustering for identifying groups in the dataset and does not require to prespecify the number of clusters to generate.
It refers to a set of clustering algorithms that build treelike clusters by successively splitting or merging them. This hierarchical structure is represented using a tree.
Hierarchical clustering methods use a distance similarity measure to combine or split clusters. The recursive process continues until there is only one cluster left or we cannot split more clusters. We can use a dendrogram to represent the hierarchy of clusters.
Dendrogram
A dendrogram is a treelike structure frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering.
Hierarchical classifications produced by either
 Agglomerative
 Divisive
The agglomerative or divisive route may be represented by a twodimensional diagram known as a dendrogram, which illustrates the fusions or divisions made at each stage of the analysis. Agglomerative clustering usually yields a higher number of clusters, with fewer leaf nodes in the cluster.
In a hierarchical classification, the data are not partitioned into a particular number of classes or clusters at a single step. Instead, the classification consists of a series of partitions, which may run from a single cluster containing all individuals to n clusters each containing a single individual.
Hierarchical clustering algorithms can be either bottomup or topdown.
Agglomerative clustering
Agglomerative clustering is Bottomup technique start by considering each data point as its own cluster and merging them together into larger groups from the bottom up into a single giant cluster.
Divisive clustering
Divisive clustering is the opposite, it starts with one cluster, which is then divided in two as a function of the similarities or distances in the data. These new clusters are then divided, and so on until each case is a cluster.
Clustering linkage comparison
In this article, we describe the bottomup approach in the detailed manner i.e. agglomerative algorithm
The necessary steps of an agglomerative algorithm are (diagrammed visually in the figure below):
 Start with each point in its own cluster.
 Compare each pair of data points using a distance metric. This could be any of the methods discussed above.
 Use a linkage criterion to merge data points (at the first stage) or clusters (in subsequent phases), where the linkage is represented by a function such as:
 Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.
 Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.
 Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters.
 Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.
 Ward’s minimum variance method: It minimizes the total withincluster variance. At each step, the pair of clusters with minimum betweencluster distance are merged.
Implementing hierarchical clustering in R programming language
Data Preparation
To perform a cluster analysis in R, generally, the data should be prepared as follows:
 Rows are observations (individuals) and columns are variables
 Any missing value in the data must be removed or estimated.
 The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one.
Here, we’ll use the builtin R dataset iris, which contains 3 classes of 50 instances each, where each class refers to a type of iris plant.
1

df <– iris

To remove any missing value that might be present in the data, type this:
1

df <– na.omit(df)

As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale:
1

df <– scale(df)

Packages need to perform hierarchical clustering
 hclust [in stats package]
 agnes [in cluster package]
We can perform agglomerative HC with hclust. First, we compute the dissimilarity values with dist and then feed these values into hclust and specify the agglomeration method to be used (i.e. “complete”, “average”, “single”, “ward.D”). We can plot the dendrogram after this.
1
2
3
4
5
6
7
8

# Dissimilarity matrix
d <– dist(df, method = “euclidean”)
# Hierarchical clustering using Complete Linkage
hc1 <– hclust(d, method = “complete” )
# Plot the obtained dendrogram
plot(hc1)

Alternatively, we can use the agnes function. These functions behave very similarly; however, with the agnes function, we can also get the agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure).
1
2

# Compute with agnes
hc2 <– agnes(df, method = “complete”)

This allows us to find certain hierarchical clustering methods that can identify stronger clustering structures. Here we see that Ward’s method identifies the strongest clustering structure of the four methods assessed.
Visualizing clustering in 3d view
Let’s examine, this time visually. How this algorithm proceeds using a simple dataset.
As visual representations are limited to three dimensions, we will only use three attributes, but the computation is similar if we use more attributes. We will display these using the scatterplot3d() function of the plot3D package.
which we will install and load after creating the attributes. We then examine the clustering solution provided by hclust() in order to assess whether it confirms the impressions we get from visual inspection.
1
2
3
4
5
6
7
8
9
10

A1 = c(2,3,5,7,8,10,20,21,23)
A2 = A1
A3 = A1
install.packages(“scatterplot3d”)
library(scatterplot3d)
scatterplot3d(A1,A2,A3, angle = 25, type = “h”)
demo = hclust(dist(cbind(A1,A2,A3)))
plot(demo)

In the left panel of the above plot, there are three groups of two points that are very close to each other. Another point is quite close to each of these groups of two.
Consider that the groups of two constitute a group of three with the points that lie closest to them. Finally, the two groups on the left are closer to each other than they are to the group of three on the right.
If we have a look at the dendrogram, we can see that the very same pattern is visible.
Complete Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

df <– iris
df <– na.omit(df)
df <– scale(df)
d <– dist(df, method = “euclidean”)
# Hierarchical clustering using Complete Linkage
hc1 <– hclust(d, method = “complete” )
# Plot the obtained dendrogram
plot(hc1)
# Compute with agnes
hc2 <– agnes(df, method = “complete”)
A1 = c(2,3,5,7,8,10,20,21,23)
A2 = A1
A3 = A1
install.packages(“scatterplot3d”)
library(scatterplot3d)
scatterplot3d(A1,A2,A3, angle = 25, type = “h”)
demo = hclust(dist(cbind(A1,A2,A3)))
plot(demo)

You can clone complete codes of dataaspirant from our GitHub account
Summary
Hierarchical methods form the backbone of cluster analysis. The need for hierarchical clustering naturally emerges in domains where it is not only required to discover similaritybased groups but also need to organize them.
This is a valuable capability wherever the complexity of similarity patterns goes beyond the limited representation power of flat clustering models.
Hierarchical clustering tends to be presented as a form of descriptive rather than predictive modeling.