Statistics and Probability, TP 4

The purpose of this exercise is to introduce you to using R for regression modeling. We will analyze 3 datasets, looking at different aspects of modeling in each.

I. Anscombe data

We will start by exploring the classic Anscombe example data set which comes with R. You will soon see why it is 'interesting' (or at least instructive). This analysis will demonstrate the importance of examining scatterplots as part of data analysis.

data(anscombe)
attach(anscombe)

Get some summary information for the four X,Y pairs:

summary(anscombe)

What do you notice about the summary statistics?

This data set consists of four pairs of X,Y data: x1 and y1 go together, etc. Find the correlation coefficient for each of the four sets. You can either do this for each pair separately ‘by hand’, e.g.

cor(x1,y1)

etc., or all at once:

cor(anscombe)

Anything interesting here? Guess what the four scatterplots will look like....

Done guessing? :) OK, now for the test - go ahead and make the four scatterplots:

par(mfrow=c(2,2),pty="s")
plot(x1,y1)
plot(x2,y2)
plot(x3,y3)
plot(x4,y4)

Were you right? Are you surprised??

We can add regression lines to the plot as follows:

plot(x1,y1)
abline(lm(y1 ~ x1))
plot(x2,y2)
abline(lm(y2 ~ x2))
plot(x3,y3)
abline(lm(y3 ~ x3))
plot(x4,y4)
abline(lm(y4 ~ x4))

1. Is there any way that you can predict what a scatterplot will look like based on univariate and bivariate summary statistics? If so, explain how; if not, provide an example demonstrating why not.

II. Thuesen data

We will now analyze the Thuesen data from the ISwR package. This data set has 24 type 1 diabetic patients observed on 2 variables (both numeric):

library(ISwR)
data(thuesen)

We can start by looking at the data:

thuesen

(You would not want to do this for very large data sets!)

As a reminder: there are several ways to access the individual variables in thuesen.

a. We could use the $ operator:

thuesen$short.velocity
thuesen$blood.glucose

b. Or we could use the subset operator [ to select individual columns:

thuesen[,"blood.glucose"]
thuesen[,2]

c. Or we could attach the data as we did with the hellung data, and just refer to the variables by their names:

attach(thuesen)
short.velocity
blood.glucose

d. This is still a lot of typing! We could assign the variables to ones with shorter names:

sv <- thuesen$short.velocity
bg <- thuesen$blood.glucose

Now get some univariate and bivariate summaries:

summary(thuesen)
mean(thuesen)
sd(thuesen)

You might have noticed that one of the observations was 'NA' (i.e. Not Available). In case you did not notice this, it will be signaled to you if you use functions on the data (e.g. mean, sd). You must specify that you want the observations with NA values removed:

mean(thuesen, na.rm=TRUE)
sd(thuesen, na.rm=TRUE)

If you type:

cor(thuesen)

you will get an error: Error in cor(thuesen) : missing observations in cov/cor. (You will have seen a similar message above with sd.) This is because of the missing value. We can learn which argument to set in cor using the help facility:

?cor
cor(thuesen, use="complete.obs")

What is the correlation between blood glucose and shortening velocity? What do you think the scatter plot will look like? Well, we just learned that the only way to know is to look!

We will be interested in predicting shortening velocity from blood glucose, so our Y variable is sv and X variable is bg:

plot(bg, sv)

We can make the plot fancier in a number of ways, for example changing the plotting character (pch), adding different axis labels (xlab, ylab):

plot(bg, sv, pch=16, xlab="Blood Glucose", ylab= "Shortening Velocity")

We use lm to estimate the regression line from the data:

th.lm <- lm(sv ~ bg)

We can add the regression line to the scatter plot:

abline(lm(sv ~ bg))

We can get enough information to write out the regression line from:

th.lm

However, we will see more comprehensive information using summary:

summary(th.lm)

2 a. Make a scatterplot of the data, and include the regression line in the plot.
b. Using the summary information, write out a formula for the regression line. Are any coefficients significant at the 5% level? If so, which?

To explore the fit of the model, we should examine the residuals more carefully. First, extract the residuals from thuesen.lm:

th.resid <- th.lm$residuals

bg.1 <- bg[! is.na(sv)]
plot(bg.1,th.resid)
abline(h=0)
qqnorm(th.resid)
qqline(th.resid)

3 a. Are any points off the line? What does this mean? Include your plot in your report.

th.fv <- fitted.values(th.lm)
plot(th.fv, th.resid)
abline(h=0)

3 b. Does the variance of the residuals seem roughly constant across the range of x? If not, what pattern do you see? Again, include the plot.

To determine if any points are influential in the fitting, we compute the 'hat' values. We want to compare the hat values to 2p/n or 3p/n, where p is the dimension of the model space (2 here). Values bigger than this may be considered 'influential'.

th.hat <- lm.influence(th.lm)
sort(th.hat$hat) # look at the sorted values
index <- seq(1:length(th.hat$hat))
plot(index,th.hat$hat,xlab="Index",ylab="Hat value")
abline(h=c(2*2/23,3*2/23),lty=c(2,3))

3 c. Do there appear to be any influential points? Which ones?

We can find measurements corresponding to the highest leverage points as follows:

th.highlev <- identify(index,th.hat$hat)

and then click on the (2) highest points. The observation number should appear next to the points (they should be 4 and 13). When you are finished, click the STOP button on the graphics menu and see which points they are:

th.highlev

We can get the glucose and velocity measurements for these points by typing

thuesen[th.highlev,]

Let's see where these points are on the original scatter plot. We'll plot these points as large blue dots:

plot(bg, sv, pch=16, xlab="Blood Glucose", ylab= "Shortening Velocity")
abline(lm(sv ~ bg))
points(bg[th.highlev],sv[th.highlev],pch=16,col="blue",cex=1.5)

3 d. Why do you think these points are the most influential?

III. Cystic Fibrosis Data

We examine the Cystic Fibrosis data. This data set has 25 rows (patients) measured on 10 variables. It contains lung function data for cystic fibrosis patients (7-23 years old). (You can find the help page here.)

Read the data into R:

library(ISwR)
data(cystfibr)
attach(cystfibr)
?cystfibr

and look at univariate summaries:

summary(cystfibr)

We may get a first look at pairwise relationships between variables as follows:

pairs(cystfibr)

Based on the plots, which pairs of variables seem to be the most highly correlated? Check your answer with cor:

cor(cystfibr)

It might be easier to read the correlation matrix after rounding:

round(cor(cystfibr),2)

Explore predicting pemax from sets of variables. For example, you could try (I'm not claiming this model is a good one!):

cf.1 <- lm(pemax ~ age + sex + height + weight)
summary(cf.1)

4. Try out several different models. Basing your answer on the value of R^2, what is your best model? What if you consider only adjusted R^2? How do you explain the difference?

[NOTE: If we were going to do careful modeling here, we would need to take into consideration other aspects as well; these are beyond the scope of this course.]

Your responses to the questions should be in complete sentences, and can be in either English or French. You can email your report to me (Darlene.Goldstein@epfl.ch) before Monday 2 July 2007.