The purpose of this exercise is to introduce you to using
An excellent source of
To proceed, you will need to start
To get some practice using statistical functions
and performing small calculations in
The
These data vectors are too small to really require summaries;
we can quickly generate some artificial data to get practice using
Now, make a number of different histograms for this same data set, varying the number of bars (also called ‘bins’ or ‘cells’). It will be easiest to see several at once if you first set up the plotting region for multiple plots, as in the first line below:
You should also make a boxplot of this simulated data set for a
different type of visualization:
Obtain other summary statistics for your simulated data: mean,
standard deviation (SD), median, interquartile range (IQR), median absolute
deviation from the median (MAD).
If you don’t know which
It is a little more interesting to look at real data.
Load the package
You can find the variable names with
Does the distribution (pattern of variability) of either variable appear to depend on the presence or absence of glucose? Do we have enough information to decide whether glucose is causing any difference?
Now, we will get a little closer to the scale of microarray data by simulating a much larger data set. We will use the normal (Gaussian) distribution for convenience - you should not take this to mean that it is realistic for microarray data: it is not!
Usually, preprocessed data values are stored in a data matrix or data frame with rows representing probes or probe sets (genes) and columns representing mRNA samples (chips). Note that this is the transpose of the usual data matrix in statistics, where rows represent individuals and columns are for different variables. With a microarray, we have measured thousands of variables on each sample, but usually only with a relatively small number of samples.
In order to be able to reproduce results of a simulation experiment,
it is a good idea to set (or save) a seed value for the random number generator,
see
After you have set the seed, create a (simulated) data matrix of
gene expression values with 30,000 rows (genes) and 10 columns (arrays/chips)
with (independent) normal random variables.
Gene expression data are not really independent across different genes,
or normally distributed either, but we will just pretend here.
Useful functions here are
One extremely useful feature of
For each single array (column), you can visually assess how closely the data appear to follow the normal distribution with a QQ plot. You can set up multiple plots per page (4 here), and change the default plotting character as follows:
(You remembered to read the help for
If the data are approximately normally distributed, then the points should mostly fall on the line. Since we have simulated normally distributed data here, deviations from normal are essentially random variation (assuming that we trust the random number generator). You can look at all 10 QQ plots.
Say that we are now interested in comparing the distributions of
expression values across the different chips.
Then we might look at side by side boxplots, for example.
Make the 10 side by side boxplots;
if you have called your data matrix
(Remember to read the help for the function
Now that you're happy with subsetting (!),
redo the boxplots, this time ordering them by their medians
(so that the array with the smallest median is first, largest is last).
Useful functions here are
As practice for the exam, you could write a short lab report on a small exploratory data analysis on data simulated as follows:
The first 3 arrays are 'Control' and the last 5 are 'Treated', while the values in the matrix represent log ratios of a sample compared to a common reference (we will learn what this means later). The actual values that you will see here are extremely unrealistic, but will give you some practice exploring before we start with some real data.
The purpose of this (simulated) study is to look for differences between treatment and control conditions. In addition to looking at overall distributional characteristics within and between slides, you should also consider looking at differences on the individual gene level (apply should be useful here). We will learn more about how to do this later, but you might already have a feel for some simple ways to approach this - you should try out any statistical ideas you have that might be applied here, and see if you are able to attach a p-value to any finding (but don't worry if you can't, it's not worth spending too much time on it at this point).
Your report should give a short background/purpose,
the number of genes and numbers and types of arrays,
a description of your analyses, along with any supporting plots/tables,
and a summary with conclusions (including which genes (rows) are different
between treatments and controls, if you are able to find any).
This should not be more than 2 pages (1 page is probably enough).
Your report should be in pdf format, and can be in either English or French.
Please name your report file lastname.pdf
(for example, mine would be goldstein.pdf).
You can email your report to me