You will carry out a comprehensive data analysis for an Affymetrix chip data set that is assigned to you individually, and write a report summarizing your findings. Your report is due by 12.00 (midi) on Friday 3 February 2017 at the latest.
As in the practice exam (TP 7), the aims are to identify genes that are differentially expressed in two conditions and to analyze one of the groups to see whether there are any subgroups of patients (samples) by carrying out a cluster analysis. The group that is to be clustered is assigned to you in the email about your data.
You will receive an email with a brief description and the clinical data for your problem
and a link to the .cel files.
Please let me know right away if you have any trouble downloading your data files or if
you have any kind of problem in understanding the data or loading the relevant files into
For this analysis, you will use the chips from both groups to carry out an analysis to determine which (if any) genes are differentially expressed between the two groups. You should make a table of the top 50 most differentially expressed genes, as well as state the total number of DE genes using a reasonable criterion (for example, adjusted p-value smaller than some specific threshhold, etc.).
Since this data set is larger than the practice exam data set, the design matrix X for your linear model for identifying DE genes will be quite large. You do not need to write out the entire design matrix, but you should indicate what the values of the entries are. For example, if you are parameterizing the model with an intercept and a coefficient to measure differential expression, then the first column of the design matrix consists of all 1's, while the second column contains 0 for one of the conditions and 1 for the other condition. Make sure that you give the correct interpretation of the second coefficient.
For this analysis, you will be examining whether there are subgroups in one of the groups, so use only the chips from those patients. You should not re-compute RMA values or redo the quality assessment. (If you excluded any chips before, you should still leave them out.) Genes to be used in the cluster analysis should be selected separately from the list of DE genes, without regard to sample type (e.g. the most variable genes, the genes with highest coefficient of variation, etc.).
I will mark your report according to these criteria (see also additional guidelines), taking into account: overall presentation, statement of background and objectives, summary of quality assessment (including supporting graphs), description of statistical analyses carried out (including description of any models fitted, design matrix, contrasts if necessary, etc.), (apparent) correctness of results (including a table of the top 50 differentially expressed genes), cluster analysis and conclusions. As an appendix, you should also include a separate plain text file with clean R code so that I will be able to replicate the results you present.
You must work alone on this exam and turn in your own work. You may not speak to anyone about this exam except for me. If you have any questions, you should ask me, you should not ask anyone else for help. You should also not give any help to anyone. Giving or receiving help (from anyone but me) will be considered as cheating on the exam, and in this case I will contact the Vice President for Academic Affairs and apply the most severe consequence possible.
The length of your final exam will be limited to 9 pages MAXIMUM, plus a 1 page list of the top 50 differentially expressed genes. (This limit does not include your R code or any references that you cite.) If the gene labels on your heatmap are too small to read, you can include an additional small table listing the genes in the order that they appear in the heatmap; this will also NOT count as part of the page limit.
Your final report should consist of 2 files: a pdf report file and a separate, PLAIN text file (not .doc, not .rtf) of your (cleaned) R code. Please name your files using the convention name.pdf, name.R (for example, my files would be called goldstein.pdf, goldstein.R). (If you are using Sweave or knitr, you can send the .Rnw or .Snw file instead.)
You should email your final report to me (darlene.goldstein at epfl.ch). If your files are too big to email, please contact me to make an alternative transfer arrangement. I should receive your final reports by 12.00 on Friday 3 February 2017 at the latest.
Please do not wait until the last minute to send your report, as I will not accept any report that is late.
Have fun and Good Luck!!