Cluster Analysis

The purpose of this TP is to get an introduction to cluster analysis using R. Cluster analysis is commonly used to search for groups in data; it is most effective when the groups are not already known. We will be clustering tumor samples from the publicly available data set analyzed by Bittner et al. (2000). You may read more about the study at the Bittner web supplement. As usual, you should always make sure you read the help documentation for each function you do not already know.


Bittner Data

Load the cluster package into R:

library(cluster)

Download the Bittner et al. data. Read this dataset into R and check that you have the right number of rows and columns (there should be 8067 rows (spots) and 38 columns (arrays)).

mel <- matrix(scan("melanoma.dat"),ncol=38,byrow=TRUE)
dim(mel)

Data preprocessing

In order to replicate their results, you will need to do the same preprocessing steps described in the paper (they used only melanoma samples, not controls; they truncated ratios at 0.02 and 50; log2; global (across an entire slide) median normalization):

mel.bittner <- mel[1:3613,1:31]
mel.bittner[mel.bittner < 0.02] <- 0.02
mel.bittner[mel.bittner > 50] <- 50
mel.bittner <- log2(mel.bittner)
mel.bittner.med <- apply(mel.bittner, 2, median)
mel.data <- sweep(data.matrix(mel.bittner), 2, mel.bittner.med)

You could check that the median for each of the 31 experiments is 0 as follows:

apply(mel.data, 2, median)


Hierarchical Clustering Algorithms

Now you are ready to perform some clustering. Cluster the melanoma samples (arrays) using hierarchical clustering. You should try varying between-cluster dissimilarity measures by modifying the “method” argument in hclust. Also try different dissimilarities by modifying the “method” argument in the function dist. Some examples (read the help information ?hclust and ?dist so that you can try others).

Correlation distance with average linkage (what Bittner et al. did)

clust.cor.average <- hclust(as.dist(1-cor(mel.data)), method = "average")
plot(clust.cor.average)

Correlation distance with complete linkage

clust.cor.complete <- hclust(as.dist(1-cor(mel.data)), method = "complete")
plot(clust.cor.complete)

Euclidean distance with single linkage

clust.euclid.single <- hclust(dist(t(mel.data)), method = "single")
plot(clust.euclid.single)


Partition Clustering Algorithms

Now repeat the above analysis with two partitioning clustering algorithms: K-means (kmeans) and Partitioning Around Medoids (PAM) using the pam function in the cluster package. For these algorithms you will need to specify the number of clusters; keep it small here, from 2 to 5 say. Find the cluster assignments of the 31 melanoma samples from each clustering. Here's one example:

K-means clustering

clust.km.2 <- kmeans(t(mel.data),2)
clust.km.2$cluster

PAM clustering

clust.pam.2 <- pam(as.dist(1-cor(mel.data)),2,diss=TRUE)
plot(clust.pam.2)


Clustering Using Melanoma and Control Samples

Now repeat your clusterings with the controls as well. The easiest way to set up the data is to change the number of columns above from 31 to 38, then do all the other steps. Are the controls clustering together, and separately from the melanoma samples?

What conclusions do you draw from these analyses?