Cluster Analysis
The purpose of this TP is to get an introduction to cluster analysis using R.
Cluster analysis is commonly used to search for groups in data;
it is most effective when the groups are not already known.
We will be clustering tumor samples from the publicly available
data set analyzed by Bittner et al. (2000).
You may read more about the study at the
Bittner web
supplement.
As usual, you should always make sure you read the
help
documentation for each function you do not already know.
Bittner Data
Load the cluster package into R:
library(cluster)
Download the
Bittner et al. data.
Read this dataset into R and check that you have the right number of
rows and columns (there should be 8067 rows (spots) and 38 columns (arrays)).
mel <- matrix(scan("melanoma.dat"),ncol=38,byrow=TRUE)
dim(mel)
Data preprocessing
In order to replicate their results,
you will need to do the same preprocessing steps described in the
paper (they used only melanoma samples, not controls; they truncated ratios at
0.02 and 50; log2; global (across an entire slide) median normalization):
mel.bittner <- mel[1:3613,1:31]
mel.bittner[mel.bittner < 0.02] <- 0.02
mel.bittner[mel.bittner > 50] <- 50
mel.bittner <- log2(mel.bittner)
mel.bittner.med <- apply(mel.bittner, 2, median)
mel.data <- sweep(data.matrix(mel.bittner), 2, mel.bittner.med)
You could check that the median for each of the 31 experiments is 0 as follows:
apply(mel.data, 2, median)
Hierarchical Clustering Algorithms
Now you are ready to perform some clustering.
Cluster the melanoma samples (arrays) using hierarchical clustering.
You should try varying between-cluster dissimilarity measures
by modifying the “method” argument in
hclust.
Also try different dissimilarities by
modifying the “method” argument in the function
dist.
Some examples (read the help information
?hclust and
?dist
so that you can try others).
Correlation distance with average linkage (what Bittner et al. did)
clust.cor.average <- hclust(as.dist(1-cor(mel.data)), method = "average")
plot(clust.cor.average)
Correlation distance with complete linkage
clust.cor.complete <- hclust(as.dist(1-cor(mel.data)), method = "complete")
plot(clust.cor.complete)
Euclidean distance with single linkage
clust.euclid.single <- hclust(dist(t(mel.data)), method = "single")
plot(clust.euclid.single)
Partition Clustering Algorithms
Now repeat the above analysis with two partitioning
clustering algorithms:
K-means (kmeans) and
Partitioning Around Medoids (PAM) using the
pam function
in the cluster package.
For these algorithms you will need to specify the number of clusters;
keep it small here, from 2 to 5 say.
Find the cluster assignments of the 31 melanoma samples from each clustering.
Here's one example:
K-means clustering
clust.km.2 <- kmeans(t(mel.data),2)
clust.km.2$cluster
PAM clustering
clust.pam.2 <- pam(as.dist(1-cor(mel.data)),2,diss=TRUE)
plot(clust.pam.2)
Clustering Using Melanoma and Control Samples
Now repeat your clusterings with the controls as well.
The easiest way to set up the data is to change the number of columns above from 31 to 38,
then do all the other steps.
Are the controls clustering together, and separately from the melanoma samples?
What conclusions do you draw from these analyses?