Statistics and Probability, TP 2

The purpose of this TP is twofold: first, to introduce you to some built in distribution functions in R that you can use for obtaining densities, cdfs, quantiles, and random numbers; and, second, to help you to understand the Central Limit Theorem (CLT). As a bonus, you will also get some experience interpreting QQ plots.

You will simulate data to help you to understand the difference between the SD of the observations and the SE of the sample mean, and what it is that is supposed to have a normal distribution. You will also simulate coin-flipping for estimating the probability of `heads' on a single flip. This is conceptually similar to estimating a population proportion from a sample. Finally, you will investigate convergence of the binomial distribution to the normal.

As a reminder, your course note is based on work that you turn in from the practicals. Responses that you need to turn in are indicated by bold numbered parts.

As usual, you should check the help files for any new functions (or any old ones you wish to review).

To proceed, you will need to start R; you can do this in the terminal window by typing the command

/unige/R_2.4.1/bin/R

I. Probability functions in R

Take a coin from your pocket and toss it 10 times (really!), keeping track of the number of times the coin lands 'Heads'. Now, this is the process you will get the computer to simulate for you. You can simulate 10 coin flips of a fair coin (p = .5) in R as follows:

rbinom(1,10,.5)

You will see a number between 0 and 10 (inclusive) appear on the screen. This number is the number of times your 'coin' landed heads. The proportion of heads is simply that number divided by the number of flips, 10 here. You can redo this command a few times to get a feel for the variability in the number of heads.

If you type ?rbinom in R, you will find 3 other functions besides ?rbinom described. Make sure you understand what these all do, and use the appropriate one(s) to answer the following:

a. What is the chance of getting exactly 3 out of 10 successes in a binomial distribution with success probability .7?
b. What is the chance of getting at least 3 successes?
c. At most 3?
d. Fewer than 3?

1. For a disease known to have a postoperative complication frequency of 20%, a surgeon suggests a new procedure. He tests it on 10 patients and there are no complications.

a. Is this study experimental or observational?
b. What is the probability of operating on 10 patients successfully (i.e. without postoperative complications) with the traditional method?
c. To think about: is the new method better than the traditional method?

Now, use help.search() to find the corresponding functions for the normal and Poisson distributions. If you type help.search("normal"), you will find a very large number of functions; you can narrow the search by typing help.search("normal distribution") (also for Poisson).

Calculate the probability for each of the following events:

a. A standard normally distributed variable is larger than 3.
b. A normally distributed variable with mean 35 and SD 6 is larger than 42.

For a Poisson RV X with rate = 9, compute the probability that X = 0; X = 3; X > 5.

2. It is well known that 5% of the normal distribution lies outside an interval approximately +/- 2 SDs about the mean. What are the limits corresponding to 10%, 1%, .01%? Show the R commands that you used in addition to the results.

II. Central Limit Theorem (CLT)

The CLT says that the sample sum/average/proportion of a 'sufficiently large' sample will be like a random draw from the normal distribution. So, let’s check it out!

First, simulate a sample of size 100 from a skewed distribution (here, we will use the chi-square distribution with 4 degrees of freedom, we will learn more details about this later). It is good scientific practice when doing computer simulations to either set or save the value of the random 'seed' so that you would be able to replicate the exact same simulation in the future. You can use your favorite number; here, I use 321:

set.seed(321)
xx <- rchisq(100,4)

3. Make a histogram to see what the distribution of sample values looks like. Also, get some summary statistics for the sample (remember how to do that?). What is the mean for your sample?

Now, do the same thing again, get another random sample and find its mean. Do this a few times, you should have a short list of means.

To see what a histogram of the means would look like, we should get a lot of them. So this time, simulate 1000 different samples of size 100; these will be stored in the columns of a matrix (called xx below). You should also get the mean and SD of each of the 1000 samples and store them:

xx <- matrix(rchisq(100000,4),ncol=1000)
dim(xx) # check the dimensions of the matrix
xxmean <- apply(xx,2,mean)
xxsd <- apply(xx,2,sd)

Make sure that you understand what the variable xxmean contains (try looking at the means of a few of the columns of xx and corresponding elements of xxmean: e.g. mean(xx[,1]), xxmean[1]; mean(xx[,2]), xxmean[2], etc.).

Assuming that our sample size of 100 is 'big enough' then the histogram of something (what?) should look like the normal distribution.

4. Make a histogram of that thing. Use a QQ plot (see ?qqnorm) to assess normality (you can also put in qqline), and comment on your findings.

The histogram should be centered at the true mean of the chi-square (4) distribution, it turns out that this true mean is 4. What is the mean of the values in your histogram?

5. The SD of the samples for a chi-square (4) distribution is about 2.8.

a. Are your sample SDs around 2.8 (summary statistics can help here)?
b. Now, find the SD of the thing that should have a normal distribution. Is it around 2.8? If not, how is the value related to 2.8? This is the difference between SD for the observations and SE for the mean, make sure you understand this.

As above, simulate 10 coin flips of a fair coin (p = .5):

rbinom(1,10,.5)

Now, get (and save) 1000 observations of the number of heads appearing in 10 flips, and look at a histogram of the number of heads (or, equivalently the proportion). Also make a QQ plot using the normal distribution.

my.flips<-rbinom(1000,10,.5)
hist(my.flips)
qqnorm(my.flips)

The 'stair-steps' in the QQ plot are due to the discreteness of the variable you are looking at (number of heads). Continue simulating 1000 sets of coin flips with p = .5, but increasing the number of flips n from 10 until the normal distribution provides a reasonably good approximation. How large an n did it take?

Now, simulate 1000 sets of n = 10 coin flips, but this time with a probability of heads p = .2. Is the distribution well-approximated by the normal? Continue simulating 1000 sets of coin flips with p = .2, again increasing n until the normal distribution provides a reasonably good approximation. How large an n did it take this time?

6. Now, simulate 1000 sets of n = 100 coin flips, but this time with a probability of heads p = .01.

a. Is this distribution well-approximated by the normal? Provide evidence for your answer.
b. Continue simulating 1000 sets of coin flips with p = .01, again increasing n until the normal distribution provides a reasonably good approximation. How large an n did it take this time?

Your responses to the questions should be in complete sentences, and can be in either English or French. You can email your report to me (Darlene.Goldstein@epfl.ch) before Monday 14 May 2007.