Statistics for Genomic Data Analysis, Week 1 Exercises

The purpose of this exercise is to introduce you to using R (or remind you!) with a simple session, and for you to start gaining some facility with R commands; the code here is to get you started. You will get the most out of the session if you also do some exploring on your own: check the help files for each function to learn what default values and optional arguments are there, and try out your own variations.

An excellent source of R documentation is the Comprehensive R Archive Network, or CRAN. There is a Swiss mirror site at http://cran.ch.r-project.org/. If you go to that site, you will find several links under ‘Documentation’ (the fourth major entry on the left side). ‘Official’ documentation is available under ‘Manuals’; other helpful documentation is under ‘Contributed’. For additional practice, you can also download R and add-on packages onto your own computer at home if you have one. I strongly encourage this.

To proceed, you will need to start R; you can do this in the terminal window by typing the command R at the prompt.

You should become acquainted with the help facility within R, it can be your friend! The basic help command is help() – within the parentheses () you would type (inside of double quotes) the name of a function whose help file you want to see, e.g. help(“mean”). If you don’t know the exact command name, use help.search(), with the name of the concept inside double quotes within the parentheses.

R has a number of functions to create data vectors, including: c(),seq(),rep(). Find out what each of these do, and make some data vectors of your choice using each.

To get some practice using statistical functions and performing small calculations in R, create a weight and corresponding height vector for computing body mass index (bmi) (this example is from the book Introductory Statistics with R by Peter Dalgaard):

weight <- c(60,72,57,90,95,72)
height <- c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91)
bmi <- weight / height^2
bmi # Type this in R to see the computed values
plot(height,weight) # A basic scatterplot

The # sign indicates a comment: anything occurring after this sign on a line is ignored by R (but can be very useful in programming at it provides a means for documenting your code).

These data vectors are too small to really require summaries; we can quickly generate some artificial data to get practice using R for simple graphical and numerical summaries. R can generate (pseudo) random numbers from different probability distributions; here is an example:

simdata <- rchisq(100,8)

Now, make a number of different histograms for this same data set, varying the number of bars (also called ‘bins’ or ‘cells’). It will be easiest to see several at once if you first set up the plotting region for multiple plots, as in the first line below:

par(mfrow=c(2,2))
hist(simdata) # this is the default version
hist(simdata,freq=FALSE) # what’s the difference here?
hist(simdata,freq=FALSE,breaks=2) # experiment with breaks
bps <- c(0,2,4,6,8,10,15,25) # make your own breakpoints
hist(simdata,freq=FALSE,breaks=bps)

You should also make a boxplot of this simulated data set for a different type of visualization: boxplot(simdata). Approximate the 5 number summary (min, Q1, median, Q3, max) by looking at the boxplot, and then obtain the 5 number summary in R with quantile(simdata).

Obtain other summary statistics for your simulated data: mean, standard deviation (SD), median, interquartile range (IQR), median absolute deviation from the median (MAD). If you don’t know which R functions there are to compute these, use help.search().

It is a little more interesting to look at real data. Load the package ISwR, and examine the variables in hellung (don't forget to read the help file).

library(ISwR)
?hellung
data(hellung)
attach(hellung)

You can find the variable names with names(hellung), and can summarize the data set with summary(hellung). Which of the variables does it not make sense to summarize like this? Make a boxplot of the variable conc. Now, make side by side boxplots of conc, one for each value of glucose; do the same with diameter (we will learn more about this notation soon):

boxplot(conc ~ glucose)
boxplot(diameter ~ glucose)

Does the distribution (pattern of variability) of either variable appear to depend on the presence or absence of glucose? Do we have enough information to decide whether glucose is causing any difference?

Now, we will get a little closer to the scale of microarray data by simulating a much larger data set. We will use the normal (Gaussian) distribution for convenience - you should not take this to mean that it is realistic for microarray data: it is not!

Usually, preprocessed log-ratios (we'll learn what this means next time) are stored in a data matrix with rows representing probes (genes) and columns representing mRNA samples (slides). Note that this is the transpose of the usual data matrix in statistics, where rows represent individuals and columns are for different variables. With a microarray, we have measured thousands of variables on each sample, but usually only with a small number of samples.

In order to be able to reproduce results of a simulation experiment, it is a good idea to set (or save) a seed value for the random number generator, see ?set.seed. You should set a seed here so that I will be able to reproduce your results, you can choose the argument for set.seed to be your lucky number.

After you have set the seed, create a (simulated) data matrix of log ratios with 30,000 rows (genes) and 10 columns (arrays) with (independent) normal random variables. Gene expression data are not really independent across different genes, or normally distributed either, but we will just pretend here. Useful functions here are rnorm and matrix. If you want to be fancy, you could attach names to the arrays (see ?colnames).

One extremely useful feature of R is its subsetting (or indexing) capability, via the subset operator [ ]. For example, to access the first column of my.data, you would type my.data[,1]; if instead you wanted to see rows 2, 4, and 5, you would type my.data[c(2,4,5),]. Play around with subsetting your large matrix.

For each single array (column), you can visually assess how closely the data appear to follow the normal distribution with a QQ plot. You can set up multiple plots per page (4 here), and change the default plotting character as follows:

par(mfrow=c(2,2),pch=".")

(You remembered to read the help for par, right??) If you have called your data matrix my.data, you can get a QQ plot for the first array (column) by typing:

qqnorm(my.data[,1])
qqline(my.data[,1])

If the data are approximately normally distributed, then the points should mostly fall on the line. Since we have simulated normally distributed data here, deviations from normal are essentially random variation (assuming that we trust the random number generator). You can look at all 10 QQ plots.

Say that we are now interested in comparing the distributions of log ratios across the different arrays. Then we might look at side by side boxplots, for example. Make the 10 side by side boxplots; if you have called your data matrix my.data, you can easily accomplish this by typing

par(mfrow=c(1,1)) # to get just 1 plot
boxplot(data.frame(my.data))

(Remember to read the help for the function data.frame).

Now that you're happy with subsetting (!), redo the boxplots, this time ordering them by their medians (so that the array with the smallest median is first, largest is last). Useful functions here are apply and order, along with subsetting. It might take you a little while to figure out how to do this, but spend some time trying before you click here to see how I did it.

As practice for the exam, you should write a short lab report on a small exploratory data analysis on data simulated as follows:

set.seed(14343)
sim.arrays <- cbind(matrix(rnorm(60000),ncol=3),matrix(rnorm(100000,.3,1.7),ncol=5))
colnames(sim.arrays) <- c("C1","C2","C3","T1","T2","T3","T4","T5")

The first 3 arrays are 'Control' and the last 5 are 'Treated', while the values in the matrix represent log ratios of a sample compared to a common reference (we will learn what this means later). The actual values that you will see here are extremely unrealistic, but will give you some practice exploring before we start with some real data.

The purpose of this (simulated) study is to look for differences between treatment and control conditions. In addition to looking at overall distributional characteristics within and between slides, you should also consider looking at differences on the individual gene level (apply should be useful here). We will learn more about how to do this later, but you might already have a feel for some simple ways to approach this - you should try out any statistical ideas you have that might be applied here, and see if you are able to attach a p-value to any finding (but don't worry if you can't, it's not worth spending too much time on it at this point).

Your report should give a short background/purpose, the number of genes and numbers and types of arrays, a description of your analyses, along with any supporting plots/tables, and a summary with conclusions (including which genes (rows) are different between treatments and controls, if you are able to find any). This should not be more than 2 pages (again, 1 page is probably enough). Your report can be in English or French. You can email your report to me (darlene.goldstein at epfl.ch) before next Wednesday (26 September 2012).

You will not need to save any R objects that you created today (unless you wish to), so feel free to ‘clean up’ after yourself with rm(). To remove all objects in your workspace (permanently and irreversibly, so be careful!), type rm(list=ls()), or simply answer n when asked if you wish to save your workspace image. This question appears on the screen when you quit R; to quit, type q(). Before quitting, try just typing q without any parentheses. This might help you to remember that you need the parentheses!