Statistics and Probability, TP 1

The purpose of this exercise is to introduce you to using R (or remind you!) with a simple session, and for you to start gaining some facility with R commands; the code here is to get you started. You will also carry out a few simple exploratory analyses to turn in (as part of the course exam). These are indicated by bold numbered parts.

You will get the most out of the session if you also do some exploring on your own: check the help files for each function to learn what default values and optional arguments are there, and try out your own variations.

An excellent source of R documentation is the Comprehensive R Archive Network, or CRAN. There is a Swiss mirror site at http://cran.ch.r-project.org/. If you go to that site, you will find several links under ‘Documentation’ (the fourth major entry on the left side). ‘Official’ documentation is available under ‘Manuals’; other helpful documentation is under ‘Contributed’. For additional practice, you can also download R and add-on packages onto your own computer at home if you have one. I strongly encourage this.

To proceed, you will need to start R; you can do this in the terminal window by typing the command

/unige/R_2.4.1/bin/R

You should become acquainted with the help facility within R, it can be your friend! The basic help command is help() – within the parentheses () you would type (inside of double quotes) the name of a function whose help file you want to see, e.g. help("mean"). If you don’t know the exact command name, use help.search(), with the name of the concept inside double quotes within the parentheses.

R has a number of functions to create data vectors, including: c(),seq(),rep(). Find out what each of these do, and make some data vectors of your choice using each.

We can quickly generate some artificial data to get practice using R for simple graphical and numerical summaries. R can generate (pseudo) random numbers from different probability distributions; here is an example:

simdata <- rchisq(100,8)

Now, make a number of different histograms for this same data set, varying the number of bars (also called ‘bins’ or ‘cells’). It will be easiest to see several at once if you first set up the plotting region for multiple plots, as in the first line below:

par(mfrow=c(2,2))
hist(simdata) # this is the default version
hist(simdata,freq=FALSE) # what’s the difference here?
hist(simdata,freq=FALSE,breaks=2) # experiment with breaks
bps <- c(0,2,4,6,8,10,15,25) # make your own breakpoints
hist(simdata,freq=FALSE,breaks=bps)

The # sign indicates a comment: anything occurring after this sign on a line is ignored by R (but can be very useful in programming at it provides a means for documenting your code).

You should also make a boxplot of this simulated data set for a different type of visualization: boxplot(simdata).
1. Approximate the 5 number summary (min, Q1, median, Q3, max) by looking at the boxplot, and then obtain the 5 number summary in R with quantile(simdata).

2. Obtain other summary statistics for your simulated data: mean, standard deviation (SD), median, interquartile range (IQR), median absolute deviation from the median (MAD). If you don’t know which R functions there are to compute these, use help.search().

It is a little more interesting to look at real data. Load the package ISwR, and examine the variables in hellung (don't forget to read the help file).

library(ISwR)
?hellung
data(hellung)
attach(hellung)

You can find the variable names with names(hellung), and can summarize the data set with summary(hellung).

3. Which of the variables does it not make sense to summarize like this? Why?

Make a boxplot of the variable conc.

4. Now, make side by side boxplots of conc, one for each value of glucose; do the same with diameter (we will learn more about this notation soon):

boxplot(conc ~ glucose)
boxplot(diameter ~ glucose)

5. Does the distribution (pattern of variability) of either variable appear to depend on the presence or absence of glucose? Explain why or why not.
6. Do we have enough information to decide whether glucose is causing any difference? Explain your answer.

You will not need to save any R objects that you created today (unless you wish to), so feel free to ‘clean up’ after yourself with rm(). To remove all objects in your workspace (permanently and irreversibly, so be careful!), type rm(list=ls()), or simply answer n when asked if you wish to save your workspace image. This question appears on the screen when you quit R; to quit, type q(). Before quitting, try just typing q without any parentheses. This might help you to remember that you need the parentheses!

Your responses to the questions should be in complete sentences, and can be in either English or French. You can email your report to me (Darlene.Goldstein@epfl.ch) before Monday 16 April 2007.