An Introduction to Analysis of Multiple Gene Expression Datasets

Laboratory Practical

Statistical Analysis Applied to Genome and Proteome Analysis

EMBnet Course, Lausanne, 5 February 2008

There are two parts:

Exploring GEO database
Statistical meta-analysis

The mandatory exercises are marked by boldface and numbered. They are located all over the page. The best way to proceed is to write the report simultaneously as you follow the page. Questions without number is not mandatory, but feel free to answer them in the report for bonus points.

Exploring GEO Database

The front page can be found at http://www.ncbi.nlm.nih.gov/geo/ (please open in a new window).

This database is administered by NCBI (just like PubMed, GenBank, Entrez, etc.).

Types of GEO records

These are the building blocks of GEO database:

Platforms (accession code start with GPL followed by numbers)
Description about specific microarray platforms; e.g. Affymetrix chip HG-U133A is GPL96
Try to enter GPL96 in the "GEO accession" field in GEO navigation; scroll down and skim the page just to have a general understanding.
The main data is the "Data table". Each row is a probe, the other columns are various associated information
Exercise 1 What is the gene symbol for the probeset 1007_s_at? Where it is located in the biological cell? (Hint: there are columns corresponding to gene ontology annotation)
Note: this is one of the most fully-annotated GPL files. For comparison, try to look at GPL1390. (You don't need to go to the front page; there is a "GEO accession" field on top of every GEO page.)
Samples (GSM*)
It countains data from individual sample (tumor, cell, condition, etc.). It is a list of expression measures for all probes in the array.
Check GSM36777. The expression values are in the "Data table". Each row of the table corresponds to a probeset. The column "ID_REF" refers to the column "ID" in the GPL96 platform file. This indirect link allows the probe annotation in GPL file to be updated independently of the expression data (which is much more stable).
Series (GSE*)
A series is a dataset from a study (typically from a journal article; link to pubmed can be found on the record, if already published).
Example: GSE4922
This is the most important type of records. Journal article needs to only mention the GSE* accession.
It groups together all GSM* records from the study (scroll down to find their listing).
It associates the series with the relevant GPL record(s). [Yes, there can be more than one platforms in one study! The actual GPL* applicable to a GSM* file is specified inside the GSM* file.]
It may contain clinical data table or other supplementary files (such as raw array data before normalization). This series has a large clinical table shown partially near the bottom of the page. The full table can be downloaded as a text file.
Not all GSE* contain clinical data; even if the data is available in GEO. Try GSE2740. Can you guess where they might be? (Hint: clinical data are information about a sample)

There are additional types of records GEO "datasets" (GDS) and profiles [Optional: if you're interested in the details, look at "Overview" from the right panel of GEO front page]. These are derived information resulting from automatic analysis of the GSE*, GPL* and GSM* records. However, not all series (GSE*) have been analyzed. The analysis types are also limited.

Browsing the GEO table of contents (list of GSE records)

From the GEO front page, click "Browse > GEO accessions > Series"

This is the main tool for surveying what datasets are available.

The tables are sorted according to the release date (the latest listed first).

The ordering can be changed by clicking the column headers (e.g. on the words "Accession", "Title", etc.)

Exercise 2 Which GSE record has the largest sample size? Are these gene expression arrays? If not, which gene expression array series have the largest sample size? Which species is being studied?

There is no simple way to survey the datasets (series) using keyword search in GEO. However, most of time we just enter known GSE* accession number (typically mentioned in the "Material and Methods" section of a journal article). Serious meta-analyzers have automated script to download the whole table of contents (or even mirror the GSE* records) and search them locally.

How to download data automatically from GEO?

We will not cover this in details here. There are various ways to do it depending on your analysis system. Typically, we want to store the GSM, GSE and GPL records as text-only. At the HTML display of any record, you can select a SOFT format which will then gives you an option to save the text-only file, which is more amenable to automatic extraction of desired information by scripting. A specifically customized scripts might be needed for each GSE, because of the flexibility formatting that the submitters can choose.

Additional notes

The European Binformatics Institute in Cambridge, UK organizes a similar public microarray database called ArrayExpress (Google it!). There is some differences in philosophy and data organization. The content is not as extensive (it is easier to upload to GEO; less requirements). The browser allows keyword search. Some datasets overlap (can be found in both GEO and ArrayExpress, but some are unique to each).

Statistical Meta-Analysis

This part requires working with R command line.

Basic meta-analysis: Fisher's method to combine p-values

p-values from multiple studies can be combined by:

X2 = -2*(log (p₁) + log (p₂) + ... )

The resulting quantity is chi-square distributed with 2k degrees of freedom, where k is the number of studies. In simple words, the combined p-value is (in R notation):

pval = pchisq( X2, 2*k, lower.tail=FALSE)

Note: replace italics in R code by an actual value.

(For those interested, look at this Wikipedia article about Fisher's method for details)

Recall that in the lecture notes (PDF here), in example I there are tests of significance in three independent datasets. The p-values from the figures are reproduced here:

dataset    p-value
Basel      0.002
Rotterdam  0.019
Amsterdam  0.015

The individual study result is most significant (at nominal level) in Basel dataset where the relationship is discovered. However, when Bonferroni multiple testing is applied (0.002 × 59 = 0.118), it is not significant at 0.05 level.

Exercise 3

What is the p-value of seeing all three results above? (use the Fisher's formula). How is the combined p-value compared to the individual ones? Is it just the average?

What is the p-value after Bonferroni multiple testing correction (for selecting from 59 genes)? Is it still significant had the authors looked at all three datasets simultaneously (instead of using the external data only for validation as claimed) and try all possible 59 genes on them?

Meta-analysis for genome-wide test statistics

Because downloading from GEO and doing differential expression analysis on each dataset is time-consuming, we will use pre-computed results. You already have some idea about the procedure from the GEO exploration above and from the previous exercises in the course.

The summary data are from two datasets NKI and UPP. Both are breast cancer data (with sample size 337 and 249, respectively).

They can be obtained from here.

Download all files with ending .txt; save them in location accessible from R command line.

Load the data into R:

NKI <- read.delim("NKIsummary.txt")
UPP <- read.delim("UPPsummary.txt")

To see what is inside:

NKI[1:10,]
UPP[1:10,]

(1:10 is a range of row numbers; you can change it to any range you like.)

The meaning of the columns

geneid       Entrez gene identifier
NKI.Z        Z-score for each gene (for association with metastasis)
UPP.Z        as above for UPP dataset

Each Z-score (also known as Z-test) correspond to a significance test for a coefficient in Cox regression model (positive score means high gene expression corresponds to poor survival or faster rate of metastasis). It can be converted to p-values and vice versa. The nice thing about it is that it is signed according to over- (positive) or under-expression (negative). The absolute value of a Z-score corresponds to a two-sided p-value through the cumulative distribution function of standard normal distribution (more about this below).

The rows (identified by geneid) do not match between the two datasets because they are from different platforms. The length of the tables are not even the same. To check their lengths:

length(NKI[,1])
length(UPP[,1])

Here we will use the (Entrez) geneid to join the datasets.

The geneid is unique and therefore can be used as a key. This is because the probes have been filtered such that

Only probe that can be mapped to geneid are used
When many probes can be mapped to the same geneid, only the most variable (the one with the highest standard deviation across arrays) is used.

R has a command called merge() which allows two dataframes (tables) to be joined based on a common column. This is similar to natural join in relational database. By default, rows without match are discarded and columns with the same name are used as the join key.

comb <- merge( NKI, UPP)

Check the combination

comb[1:10,]

Now, the results from the two datasets have been matched. [Note: merge() use columns with the same in both datasets as the join keys by default. It is possible to specify arbitrary columns as join keys.]

Exercise 4 How many genes are found in both datasets?

Exercise 5 It is nice if we know what the genes are. Load the gene annotation file (geneid_symbol.txt) and create a new data set called comb2 with corresponding gene symbols joined.

Exploring the joint distribution of the summary statistics from different datasets

Now that we have the data matched, we can proceed with the statistical analysis.

First, it's convenient to directly refer to the columns of comb2 as free standing variables.

attach(comb2)

For example,

NKI.Z[1:10]

gives the first ten values of NKI.Z. Without attach(), we have to refer to refer to this variable as comb2$NKI.Z.

To compute the two-sided p-values:

NKI.p <- 2*pnorm(abs(NKI.Z),lower.tail=FALSE)
UPP.p <- 2*pnorm(abs(UPP.Z),lower.tail=FALSE)

It is useful to show the ranks in individual dataset according to the p-values

NKI.rank <- rank(NKI.p)
UPP.rank <- rank(UPP.p)

Let us put the variables into a frame for easy display of sorted rows

comb3 <- data.frame(NKI.Z,NKI.p,NKI.rank,UPP.Z,UPP.p,UPP.rank,symbol)

To show them sorted by the p-values:

comb3[order(NKI.p),][1:20,]

and for the other dataset:

comb3[order(UPP.p),][1:20,]

It seems that there is not much agreement in the rank of individual dataset.

To have a feeling of the agreement, we can make a scatter plot of the Z-score.

plot(NKI.Z,UPP.Z)

Do we see any trend? Note that deviation towards upper-right direction corresponds to over-expression (and the opposite for under-expression). The closer the point to the main diagonal, the more consistent the Z-scores of the two datasets. Points in upper-left or lower-right far from the origin are inconsistent (the signs are reversed).

To check the correlation:

cor(NKI.Z,UPP.Z)

To test if this is not just what encountered by chance (i.e., when the scores are randomly paired between the two datasets), just use the linear model:

summary(lm(NKI.Z ~ UPP.Z))

Exercise 5a What is the correlation between NKI.Z and UPP.Z? What is the p-value of the test of linear correlation?

You may ask why don't we look at scatter plot of ranks or p-values. Try them.

Note that it is the order of magnitude of p-value that is more intuitively meaningful.

A much more useful quantity is -log(p). So,

plot(-log(NKI.p),-log(UPP.p))

Here, we see also many points deviating somewhat unusually towards the upper-right direction. The Fisher's method is in fact testing this. [Note that the sum of the -log(p) in this plot correspond to locations in lines going from lower-left to upper-right.]

The main disadvantage of this plot (as opposed to comparison of Z-scores) is that information about over- or under-expression is lost if the p-values are two sided. If they are one-sided, you need separate plots for over- and under-expression. In the Z-scores plot, both are accomodated simultaneously.

Intersection of significant genes

In the Venn diagram approach, decisions about "significant" vs "non-significant" are made separately for each dataset. The intersection is then examined. First, we need some multiple testing adjustments. We'll use Bonferroni and FDR (Benjamini-Hochberg)

library(stats)
NKI.bonf <- p.adjust(NKI.p,method="bonferroni")
UPP.bonf <- p.adjust(UPP.p,method="bonferroni")
NKI.fdr <- p.adjust(NKI.p,method="fdr")
UPP.fdr <- p.adjust(UPP.p,method="fdr")

Create significance lists. For example, say, with cutoff value of 0.05

NKI.sigbonf <- ifelse(NKI.bonf < 0.05, 1, 0)
UPP.sigbonf <- ifelse(UPP.bonf < 0.05, 1, 0)
NKI.sigfdr <- ifelse(NKI.fdr < 0.05, 1, 0)
UPP.sigfdr <- ifelse(UPP.fdr < 0.05, 1, 0)

The above simply assign 1 or 0 according to the significance.

To cross tabulate the combination:

table(NKI.sigbonf,UPP.sigbonf)

How many genes are in the intersection? How many disagree? Bonferroni criteria produces almost no gene in UPP.

Try the same tabulation for FDR criteria. How does it differ? Are the genes in the intersection just by chance?

Fisher's exact test are often used to test this kind of tables (the so called "gene set testing"). The interpretation and appropriateness is still debated (see Goeman and Buehlmann 2007 Bioinformatics 23:980). However, it is a common method, so here it is:

fisher.test(table(NKI.sigfdr,UPP.sigfdr))

This is a sign of "enrichment" in the intersection (we observe genes more frequently than expected).

To see which genes are in the intersection, we construct a new data frame:

sigdf <- data.frame(symbol,NKI.sigfdr,UPP.sigfdr)

and then select those with the desired combination of indicators:

subset(sigdf,NKI.sigfdr == 1 & UPP.sigfdr == 1)

Meta-analytical approach

We will use the inverse normal method (see lecture) to combine Z-scores. This is quite simple for just two datasets.

total.Z <- (NKI.Z + UPP.Z)/sqrt(2)

The nominal two-sided p-value can be easily obtained:

total.p <- 2*pnorm(abs(total.Z),lower.tail=FALSE)

Multiple testing adjustment:

total.fdr <- p.adjust(total.p,method="fdr")
total.bonf <- p.adjust(total.p,method="bonferroni")

Exercise 6 How many genes are significant under Bonferroni criteria of p < 0.05? How many under FDR < 0.05?

Bonferroni correction is known as a very conservative method (as we've seen in analysis of individual studies). However, we see here that when two datasets are combined using the above method, many genes can be found. Recall that only one gene is found in UPP using this criteria. The important point here is that datasets that seems to be "weak" individually, may reinforce one another when combined.

[Optional task: If you're interested and have time, check the EMC dataset (which you have downloaded but not yet enter into R). Think about how you can examine three datasets. ]

To see what the meta-analysis procedure does in ranking the genes, we can examine the table:

final <- data.frame(symbol,NKI.p,UPP.p,total.p)
final[order(total.p),][1:20,]

We see that ranking based on combined p-value also select genes that are reasonably strong in both datasets.

Top-ranking genes based on one dataset (particularly UPP which is the weaker one), may select genes with not so good overall p-value:

final[order(UPP.p),][1:20,]