Classification

The purpose of this TP is to carry out some classification analyses using the R/BioConductor package CMA. This package also requires the packages Biobase, class and e1071. As usual, make sure that you read the help for any new functions that you use.

We will work through parts of the package vignette. The vignette contains more complete information. You can access it in R with the function openVignette(). We use the famous leukemia dataset of Golub et al. as a data example. The sample consists of 38 observations in total, from which 27 belong to class 0 (acute lymphoblastic leukemia) and 11 to class 1 (acute myeloid leukemia).

Workflow for a single classifier

0. Getting started. First load in the data, and set up the gene expression matrix (X) and labels (Y).

library(CMA)
data(golub)
?golub

golubY <- golub[,1]
golubX <- as.matrix(golub[,-1])

1. Generate learning sets. We generate several learning samples by different splitting rules (they each have different advantages and disadvantages, so it can be useful to look at different rules). Learning samples are generated by the function GenerateLearningsets, which returns an object of class learningsets.

loodat <- GenerateLearningsets(y=golubY, method ="LOOCV")
class(loodat)
getSlots(class(loodat))
show(loodat)

When carrying out any kind of simulation, it is a good idea to set the seed so that the results are reproducible. To carry out five-fold cross-validation:

set.seed(321)
fiveCVdat <- GenerateLearningsets(y=golubY, method = "CV",
fold = 5, strat = TRUE)

Stratified learning sets are generated by setting the argument strat = TRUE. Now proceed analogously for Monte-Carlo cross-validation and bootstrap.

set.seed(456)
MCCVdat <- GenerateLearningsets(y=golubY, method = "MCCV", niter = 3,
ntrain = floor(2/3*length(golubY)), strat = TRUE)

set.seed(651)
bootdat <- GenerateLearningsets(y=golubY, method = "bootstrap",
niter = 3, strat = TRUE)

2. Variable selection. In this example, we will carry out classification using the Support Vector Machine with linear kernel. Variable selection is not strictly necessary, but it has empirically been shown that the performance of the SVM method can be significantly improved when noise features are removed. For simplicity, we choose the nonparametric Wilcoxon test to rank the variables, separately for each learning sample. Other choices are also possible (see the vignette).

varsel_loo <- GeneSelection(X = golubX, y=golubY,
learningsets = loodat, method = "wilcox.test")

varsel_fiveCV <- GeneSelection(X = golubX, y=golubY,
learningsets = fiveCVdat, method = "wilcox.test")

varsel_MCCV <- GeneSelection(X = golubX, y=golubY,
learningsets = MCCVdat, method = "wilcox.test")

varsel_boot <- GeneSelection(X = golubX, y=golubY,
learningsets = bootdat, method = "wilcox.test")

Here we have a look at varsel_fiveCV (you can look at the others as well). The toplist method provides easy access to the top-ranked variables. Here, we find for each gene in the top 10 the number of learning samples the gene occurs. If you use these seeds, I think you should find that there is only 1 variable in the top 10 for all 5 learning sets. If you use a different seed, you can get different results.

show(varsel_fiveCV)
toplist(varsel_fiveCV, iter=1)
seliter <- numeric()
for(i in 1:5)
seliter <- c(seliter, toplist(varsel_fiveCV, iter = i,
top = 10, show = FALSE)$index)
sort(table(seliter), dec = TRUE)

3. Hyperparameter tuning. To select values for the SVM tuning constant C (cost), we will take the top 100 ranked genes. Hyperparameter tuning can be very time-intensive, so for this example we only use five-fold CV. The resulting best value for C will then be used for the other other CV procedures as well. In this example, we consider five candidate values: 0.1, 1, 10, 100, 200. Due to the nested cross-validation procedure, the SVM is trained a total of 3 x 5 x 5 = 75 times. Three is to the number of inner cross-validation folds, the first five is the number of candidate values and the second five is the number of outer cross-validation folds.

CMA:::tune is to make sure that the tune function used is the one from CMA (and not some other package).

set.seed(351)
tuningstep <- CMA:::tune(X = golubX, y=golubY, learningsets = fiveCVdat,
genesel = varsel_fiveCV, nbgene = 100, classifier = svmCMA,
grids = list(cost = c(0.1, 1, 10, 100, 200)))

show(tuningstep)
unlist(best(tuningstep))

To visualize the results (does not seem to be very interesting in this example):

par(mfrow = c(2,2))
for(i in 1:4) plot(tuningstep, iter = i, main = paste("iteration", i))

4. Class prediction. You can use the tuned parameter value the you found above, or the value 100 that is used in the vignette. In order to do some interesting plotting later, we will use the SVM version that gives class probabilities (probability = TRUE).

class_loo <- classification(X = golubX, y=golubY, learningsets = loodat,
genesel = varsel_loo, nbgene = 100, classifier = svmCMA,
cost = 100, probability=TRUE)

class_fiveCV <- classification(X = golubX, y=golubY, learningsets = fiveCVdat,
genesel = varsel_fiveCV, nbgene = 100, classifier = svmCMA,
cost = 100, probability=TRUE)

class_MCCV <- classification(X = golubX, y=golubY, learningsets = MCCVdat,
genesel = varsel_MCCV, nbgene = 100, classifier = svmCMA,
cost = 100, probability=TRUE)

class_boot <- classification(X = golubX, y=golubY, learningsets = bootdat,
genesel = varsel_boot, nbgene = 100, classifier = svmCMA,
cost = 100, probability=TRUE)

The results of classification are lists where each element is an object of class cloutput, generated for each learning sample. For visualization, we use the join function to combine the single list elements into 'big' objects. We first put the results from the four splitting schemes into one list and then use lapply():

resultlist <- list(class_loo, class_fiveCV, class_MCCV, class_boot)
result <- lapply(resultlist, join)

The plot,cloutput-method can be used to visualize the 'voting'.

schemes <- c("loo CV", "five-fold CV", "MCCV", "bootstrap")
par(mfrow = c(2,2))
for(i in seq(along = result)) plot(result[[i]], main = schemes[i])

ftable applied to objects of class cloutput yields confusion matrices.

invisible(lapply(result, ftable))

For simple ROC curves:

par(mfrow = c(2,2))
for(i in seq(along = result)) roc(result[[i]])

We can now join again to aggregate over the different splitting rules:

totalresult <- join(result)
ftable(totalresult)

Confusion matrices implicity quantify performance via misclasification. For more advanced performance evaluation, one can use evaluation. Note that the input has to be a list (and not an object of class cloutput). For the Monte-Carlo cross-validation scheme, we have:

av_MCCV <- evaluation(class_MCCV, measure = "average probability")
show(av_MCCV)
boxplot(av_MCCV)
summary(av_MCCV)

By default, the evalution scheme is iterationwise, but it can also be done observation-wise

av_obs_MCCV <- evaluation(class_MCCV, measure = "average probability",
scheme = "obs")
show(av_obs_MCCV)

To find out which observations are misclassified very often:

obsinfo(av_obs_MCCV, threshold = 0.6)

Classifier comparison

If you have time ... you can carry out a comparison similar to the one showed during the lecture. Here, instead of doing 10 iterations of five-fold CV (niter=10), we just do one.

In order to make some of the comparison plots that we saw above, we will need to have class probabilities from the classifier. Not all classifiers produce class probabilities, in particular knn and SVM. We will use the versions that do produce the class probabilities: svmCMA, with probability = TRUE, and pknnCMA for knn.

set.seed(27611)

# Generate learning sets:
golub.cv <- GenerateLearningsets(y=golubY, method = "CV",
fold = 5, strat = TRUE)
(golub.genesel <- GeneSelection(X=golubX, y=golubY, learningsets = golub.cv,
method = "t.test", scheme = "one-vs-all"))

# Carry out classification:
golub.lda <- classification(X=golubX, y=golubY, learningsets=golub.cv,
classifier=ldaCMA, genesel=golub.genesel, nbgene = 10)

golub.qda <- classification(X=golubX, y=golubY, learningsets=golub.cv,
classifier=qdaCMA, genesel=golub.genesel, nbgene = 1)

golub.dda <- classification(X=golubX, y=golubY, learningsets=golub.cv,
classifier=dldaCMA, genesel=golub.genesel, nbgene = 10)

golub.pknn <- classification(X = golubX, y=golubY, learningsets=golub.cv,
classifier=pknnCMA, genesel=golub.genesel, nbgene = 10)

golub.svm <- classification(X=golubX, y=golubY, learningsets=golub.cv,
classifier=svmCMA, genesel=golub.genesel, nbgene = 10, probability=TRUE)

golub.rf <- classification(X=golubX, y=golubY, learningsets=golub.cv,
classifier=rfCMA)

# Compare results:
par(mfrow=c(3,1))
golub.all <- list(golub.dda, golub.lda, golub.qda, golub.pknn, golub.svm, golub.rf)
golub.comp1 <- compare(golub.all, plot = TRUE,
measure = c("misclassification","brier score","auc"))
print(golub.comp1)

# Confusion matrix, plots:
resultlist <- list(golub.dda,golub.lda,golub.qda,
golub.pknn,golub.svm,golub.rf)
result <- lapply(resultlist, join)
invisible(lapply(result, ftable))

schemes <- c("DLDA", "LDA", "QDA", "pknn", "svm", "rf")

par(mfrow=c(3,2))
for(i in seq(along = result)) plot(result[[i]], main = schemes[i])

for(i in seq(along = result)) roc(result[[i]],main=schemes[i])