Cluster analysis is a method of classification, aimed at grouping objects based on the similarity of their attributes. It is commonly used to group a series of samples based on multiple variables that have been measured from each sample. The procedure produces a tree-like diagram (a dendrogram) that illustrates the relationships between all the samples based on a defined measure of similarity.
There are many methods available for clustering (agglomerative, divisive, non hierarchical etc.). Here are some instructions for one of the more commonly used methods, agglomerative hierarchical clustering. This procedure involves a series of steps:
In this example, we will use cluster analysis to visualise differences in the composition of metal contaminants in the seaweeds of Sydney Harbour (data from (Roberts et al. 2008).). Download the data set, Harbour_metals.csv, and load into R.
Harbour_metals <- read.csv(file = "Harbour_metals.csv", header = TRUE)
These data have the concentrations of seven metals measured from 60 samples, half from the seaweed Padina crassa and half from the Sargassum linearifolium
The first two columns are categorical variables that group samples by site and seaweed species. The third column has unique labels for each replicate sample and the remaining columns are the metal concentrations.
The cluster analysis runs on the response variables only so we need to make a data frame with just the metal concentrations (columns 4-8).
Harbour_metals2 <- Harbour_metals[, 4:8]
To help interpret the graph, we can add the sample labels as row names in this data frame
rownames(Harbour_metals2) <- Harbour_metals$Rep
To perform the cluster analysis, we need to make a matrix that quantifies the similarity between each pair of samples. Here we will use the Euclidean distance as our similarity coefficient, but there are others to choose from (see below).
H_metals.sim <- dist(Harbour_metals2, method = "euclidean")
We then use hclust
function with an argument that specifies the linkage method (here we will use the single linkage method).
H_metals.cluster <- hclust(H_metals.sim, method = "single")
Finally, plot the object that was created by the hclust function.
plot(H_metals.cluster)
We can make this a bit neater with all the samples lined up along the bottom by converting it to a dendrogram object (the as.dendrogram
function) and plotting that (although this default plot is still a bit ugly and would need work with labels and axes before being ready for publication).
plot(as.dendrogram(H_metals.cluster), ylab = "Euclidean distance")
A dendrogram has a branch for each sample joined at nodes that relate to the value of the similarity coefficient that joined the two objects. Interpretation of all the relationships comes from examination of the branching structure (which objects join most closely with each other) and from the similarities at which they join. Objects that join together close the tips of the branches are more similar to each other than those that join further toward the base of the tree (note that the default dendogram in R is like an upside down tree with the branches at the bottom and trunk at the top).
Strong evidence for distinct groups would be evident if there were clusters where the samples within a cluster are much more similar to each other than samples in other clusters.
Before running a cluster analysis to create a dendrogram, you need to consider:
What measure of similarity to use. Dendograms can be created from any similarity matrix. There are many distance measures that can be used to describe the similarity between samples. The dist
fuction in R has "euclidean"
, "maximum"
, "manhattan"
, "canberra"
, "binary"
or "minkowski"
(specified in the method argument of dist
). Other measures are available in other packages (e.g., the Bray-Curtis measure which is recommended for analyses of species composition data is available in the package vegan).
Whether the data need transforming or standardising. If the variables are measured on very different scales, or if there are outliers, then the structure of the dendrogram will be strongly influenced by the values of the largest values in the data set. Variables can be transformed or standardised to lessen the influence of large values (i.e., treat all variables on a more equal footing).
What method will be used to create the dendrogram. The structure of dendrograms can also be sensitive to algorithm used to construct the tree (the linkage method). In the previous example, you used the single linkage method. The hclust
function in R has several available: "ward"
, "single"
, "complete"
, "average"
, "mcquitty"
, "median"
and "centroid"
.
Written. The interpretation of the dendrogram would be described in the text of a Results section (i.e, were there any obvious clusters of samples? were there any samples that were very different from the rest?). There are no numerical results to report.
Visual. Results from cluster analyses are communicated visually with the dendrogram. It is important to label the axis to show what measure of similarity was used in the analysis.
If there are pre-determined groups of samples (e.g., samples grouped by species in the example above) samples are usually labelled or provided with colour-coded symbols to allow patterns in the dendrogram to be more easily seen. With large data sets individual labels for each sample (as we did here) make for very plots.
With many samples in your dendrogram it is often necessary to label them in some way (e.g., by colour) to help see patterns. The package dendextend allows you to do this. Here is some code to produce the same dendrogram with the samples colour coded by location.
library(dendextend)
dend <- as.dendrogram(H_metals.cluster)
sample_colours <- as.numeric(Harbour_metals$Location)
sample_colours <- sample_colours[order.dendrogram(dend)]
labels_colors(dend) <- sample_colours
plot(dend, ylab = "Euclidean distance")
You can access R help for the main functions used here by typing ?hclust
, ?dist
or ?as.dendrogram
. There are many R packages associated with the many different types of cluster analysis. See a long list of possibly useful packages here.
Quinn, GP and MJ Keough (2002) Experimental design and data analysis for biologists. Cambridge University Press. Ch. 18. Multidimensional scaling and cluster analysis.
McKillup, S (2012) Statistics explained. An introductory guide for life scientists. Cambridge University Press. Ch. 22. Introductory concepts of multivariate analysis.
Author: Alistair Poore
Year: 2016
Last updated: Feb 2022