Cluster analysis is commonly used to search for groups in data;
it is most effective when the groups are not already known.
Here you will get a light introduction to cluster analysis in
Load the
Download the Bittner et al. data
here.
The original images are not publicly available,
but image-processed signals are.
Read this dataset into
In order to replicate their results, you will need to do the same preprocessing steps described in the paper (they used only melanoma samples, not controls; they truncated ratios at 0.02 and 50; log2; global (across an entire slide) median normalization):
You could check that the median for each of the 31 experiments is 0 as follows:
Now you are ready to perform some clustering.
Cluster the melanoma samples (arrays) using the (hierarchical)
You should try varying between-cluster dissimilarity measures
by modifying the “method” argument in
Correlation distance with average linkage (what Bittner et al. did):
Correlation distance with complete linkage:
Euclidean distance with single linkage:
Now repeat the above analysis with a partitioning clustering algorithm
You can also try out the partitioning clustering algorithm PAM
(Partitioning Around Medioids).
See
Now do the clustering with the controls as well. The easiest way to set up the data is to change the number of columns above from 31 to 38, then do all the other steps. Are the controls clustering together, and separately from the melanoma samples?
Filter the genes according to some criterion of your choice, so that you end up with about 50 –– 100 genes. This is so that you can practice making a heat diagram. You might choose select those with most variability across the samples, those with highest standardized difference between the 2 groups that Bittner et al. distinguished or 2 clusters that you determine yourself, there are many possibilities here.
Choose one of the clusterings that you carried out above and redo it using only the selected (50 – 100) genes. As an illustration, here's a simple example using just the first 100 genes:
The 31 samples are clustered along the columns, and the 100 genes are in the rows. You will notice that the default color scheme is not very nice. You can change this though, by using a different palette or making your own. If you are lucky, you might find an interesting pattern!
You can experiment making heat diagrams for different clusterings,
and explore some of the
(Optional) It might also be interesting to explore different ways
of estimating the number of
clusters using the
Based on all of the above, what do you think about the reported discovery of a new subclass of melanoma?
For practice (remember that exam??), you can send me a short report including the purpose of the investigation, an explanation of the clustering method you finally decide on along with graphical displays (heatmap, dendrogram, silhouette, ... depending on what is appropriate for the choices you made). This report should not be more than about 4 pages.
As usual, your report can be in English or French.
Please send your report as a pdf file,
and follow the naming convention: lastname.pdf
(e.g. my report would be goldstein.pdf).
If you email your report to me