Heatmaps are a useful method to explore large multivariate data sets. Response variables (e.g., abundances) are visualised using colour gradients or colour schemes. With the right transformation, and row and column clustering, interesting patterns within the data can be seen. They can also be used to show the results after statistical analysis, for example, to show those variables that differ between treatment groups.
In this tutorial, we will use heatmaps to visualise patterns in the bacterial communities found within marine habitats in the presence of two macrophytes (seagrass and Caulerpa) at two densities (sparse and dense). There are also samples from unvegetated sediment (Other). There are three replicate samples in each group.
We will use the package pheatmap (pretty heatmaps) to draw our heatmaps. The base package of R can draw heatmaps as well, but is somewhat limited. First, install the package and load into R. We will also need the package dplyr for selecting rows and columns.
Reading in the data
With multivariate data, we often have two data frames with 1) the counts per sample and 2) the factors that group samples. Download these two data files, Bac.counts.csv and Bac.factors.csv, and import into R.
Bac.counts <- read.csv(file = "Bac.counts.csv", header = TRUE, row.names=1) Bac.factors <- read.csv(file = "Bac.factors.csv", header = TRUE, row.names=1)
row.names=1 argument assigns the first column of the spreadsheet as row names in the data frame. We should check the data structure of the counts using the
head command. Given there are many columns, we’ll only check the first 10 using indexing (use of
[,] after the object, with row numbers before the comma and column numbers after).
DC1 DC2 DC3 DS1 DS2 DS3 SC1 SC2 SC3 SS1 Otu00002 2906 2619 2200 2959 3205 2455 2815 2761 2275 3519 Otu00003 1631 1323 1258 1055 1552 1509 1345 1255 1270 1180 Otu00005 1493 1416 1592 984 1131 879 1430 1448 1296 1431 Otu00004 1171 1164 1489 936 1514 1174 1271 1310 1207 1278 Otu00006 1160 1226 1245 764 1134 1271 906 983 1110 1251 Otu00007 1112 1042 1211 1060 1155 1103 1283 1198 1175 1485
We can see that the row names of the data have the code numbers for each operational taxonomic unit (OTU) of bacteria and there are integer counts of these in each sample. Now, we will check the dimensions of the data (number of rows and columns).
 4299 15
There 4299 bacterial operational taxonomic units (OTUs) as rows among 15 samples as columns.
Next we can check the structure of the factor information using
str. In this experiment, samples are categorised by a treatment ID (each combination of density and species), density levels and species levels. The other columns are for plotting purposes elsewhere.
'data.frame': 15 obs. of 6 variables: $ Treatment_ID: Factor w/ 5 levels "DC","DS","SC",..: 1 1 1 2 2 2 3 3 3 4 ... $ Density : Factor w/ 3 levels "Dense","Other",..: 1 1 1 1 1 1 3 3 3 3 ... $ Species : Factor w/ 3 levels "Caulerpa","Other",..: 1 1 1 3 3 3 1 1 1 3 ... $ pch1 : int 16 16 16 15 15 15 21 21 21 22 ... $ pch2 : int 4 22 21 15 16 NA NA NA NA NA ... $ legend : Factor w/ 6 levels "","DC - Dense Caulerpa",..: 6 5 4 3 2 1 1 1 1 1 ...
Drawing a heatmap
The basic function is
pheatmap. Let’s try it without special arguments, except that we will only look at the first 500 OTUs (they are arranged from highest to lowest total abundance already). The function
slice in dplyr can take any subset of numbered rows (see Subsetting data).
Bac.counts500 <- slice(Bac.counts, 1:500) pheatmap(Bac.counts500)
In the figure, samples are columns and bacterial OTUs are rows, with the colour representing the range of counts of each OTU in each sample. Red means most abundant (~3500 counts), blue the least abundant (0 counts) and light yellow somewhere in the middle. Note that both the rows and columns have been rearranged based on measures of similarity among rows and columns (see Cluster analysis).
Data transformation. Now you might notice that we have a scale issue. The data are full of rare-ish bacteria (blue) and that is all we can see on the heatmap. To visualise this more effectively, we can try a log10 transformation with + 1 constant to deal with zeros.
Bac.Log10.counts500 <- log10(Bac.counts500 + 1) pheatmap(Bac.Log10.counts500)
The view of the data has really changed with transformation, so has the row and column clustering. There appears to be some very abundant OTUs (red/yellow), some mid abundant (white/low) and low in low abundance (blue).
This would not have been seen so clearly if we did not cluster the rows and columns, and if we had just plotted them “as is” from the data table (although some row ordering had already been done here). You can see this if we draw the heatmap again without clustering the rows and columns.
pheatmap(Bac.Log10.counts500, cluster_rows = FALSE, cluster_cols = FALSE)
Colouring sample groups
Before looking further at how clustering effects the patterns observed, we should add some colours associated with the treatment groups next. The simplest method is to use the Bac.factors dataframe as input, taking care that you 1) specify categorical covariates (factor groups) and numerical covariates (e.g. concentration) properly, 2) that row names in Bac.factors match those in Bac.counts, and 3) then remove the factors you do not want to show.
To extract just the density and species columns from our factors data frame, we can use
select from the package dplyr and then use these to colour code our samples.
Bac.factorsDS <- select(Bac.factors, Density, Species) pheatmap(Bac.Log10.counts500, annotation_col = Bac.factorsDS)
The colours are pretty ugly. To make your own is tricky, but involves making named colour vectors and then adding them to a list. This represents the “colour annotation” information. We have to define colours for categorical covariate (factor groups) and colour ranges for numerical covariates (e.g. concentration).
# Reorder Density levels to Sparse, Dense, Other Bac.factorsDS$Density = factor(Bac.factorsDS$Density, levels = c("Sparse", "Dense", "Other")) DensityCol <- c("darkorchid", "darkorange", "grey80") names(DensityCol) <- levels(Bac.factorsDS$Density) # Reorder Species to Seagrass, Caulerpa, Other Bac.factorsDS$Species <- factor(Bac.factorsDS$Species, levels = c("Seagrass", "Caulerpa", "Other")) SpeciesCol <- c("forestgreen", "blue3", "grey80") names(SpeciesCol) <- levels(Bac.factorsDS$Species) # Add to a list, where names match those in factors dataframe AnnColour <- list( Density = DensityCol, Species = SpeciesCol) # Check the output AnnColour
$Density Sparse Dense Other "darkorchid" "darkorange" "grey80" $Species Seagrass Caulerpa Other "forestgreen" "blue3" "grey80"
We can now redraw the heatmap with our chosen colours.
pheatmap(Bac.Log10.counts500, annotation_col = Bac.factorsDS, annotation_colors = AnnColour)