©2008 Pratyaksha Wirapati, Swiss Institute of Bioinformatics
This database is administered by NCBI (just like PubMed, GenBank, Entrez, etc.).
Description about specific microarray platforms; e.g. Affymetrix chip HG-U133A is GPL96
Try to enter GPL96 in the "GEO accession" field in GEO navigation; scroll down and skim the page just to have a general understanding.
The main data is the "Data table". Each row is a probe, the other columns are various associated information
Exercise 1 What is the gene symbol for the probeset 1007_s_at? Where it is located in the biological cell? (Hint: there are columns corresponding to gene ontology annotation)
Note: this is one of the most fully-annotated GPL files. For comparison, try to look at GPL1390. (You don't need to go to the front page; there is a "GEO accession" field on top of every GEO page.)
It countains data from individual sample (tumor, cell, condition, etc.). It is a list of expression measures for all probes in the array.
Check GSM36777. The expression values are in the "Data table". Each row of the table corresponds to a probeset. The column "ID_REF" refers to the column "ID" in the GPL96 platform file. This indirect link allows the probe annotation in GPL file to be updated independently of the expression data (which is much more stable).
A series is a dataset from a study (typically from a journal article; link to pubmed can be found on the record, if already published).
Example: GSE4922
This is the most important type of records. Journal article needs to only mention the GSE* accession.
It groups together all GSM* records from the study (scroll down to find their listing).
It associates the series with the relevant GPL record(s). [Yes, there can be more than one platforms in one study! The actual GPL* applicable to a GSM* file is specified inside the GSM* file.]
It may contain clinical data table or other supplementary files (such as raw array data before normalization). This series has a large clinical table shown partially near the bottom of the page. The full table can be downloaded as a text file.
Not all GSE* contain clinical data; even if the data is available in GEO. Try GSE2740. Can you guess where they might be? (Hint: clinical data are information about a sample)
This is the main tool for surveying what datasets are available.
The tables are sorted according to the release date (the latest listed first).
The ordering can be changed by clicking the column headers (e.g. on the words "Accession", "Title", etc.)
Exercise 2 Which GSE record has the largest sample size? Are these gene expression arrays? If not, which gene expression array series have the largest sample size? Which species is being studied?
There is no simple way to survey the datasets (series) using keyword search in GEO. However, most of time we just enter known GSE* accession number (typically mentioned in the "Material and Methods" section of a journal article). Serious meta-analyzers have automated script to download the whole table of contents (or even mirror the GSE* records) and search them locally.
p-values from multiple studies can be combined by:
X2 = -2*(log (p1) + log (p2) + ... )
The resulting quantity is chi-square distributed with 2k degrees of freedom, where k is the number of studies. In simple words, the combined p-value is (in R notation):
pval = pchisq( X2, 2*k, lower.tail=FALSE)
Note: replace italics in R code by an actual value.
(For those interested, look at this Wikipedia article about Fisher's method for details)
Recall that in the lecture notes (PDF here), in example I there are tests of significance in three independent datasets. The p-values from the figures are reproduced here:
dataset p-value Basel 0.002 Rotterdam 0.019 Amsterdam 0.015The individual study result is most significant (at nominal level) in Basel dataset where the relationship is discovered. However, when Bonferroni multiple testing is applied (0.002 × 59 = 0.118), it is not significant at 0.05 level.
Exercise 3
What is the p-value of seeing all three results above? (use the Fisher's formula). How is the combined p-value compared to the individual ones? Is it just the average?
What is the p-value after Bonferroni multiple testing correction (for selecting from 59 genes)? Is it still significant had the authors looked at all three datasets simultaneously (instead of using the external data only for validation as claimed) and try all possible 59 genes on them?
The summary data are from two datasets NKI and UPP. Both are breast cancer data (with sample size 337 and 249, respectively).
They can be obtained from here.
Download all files with ending .txt; save them in location accessible from R command line.
Load the data into R:
NKI <- read.delim("NKIsummary.txt") UPP <- read.delim("UPPsummary.txt")
To see what is inside:
NKI[1:10,] UPP[1:10,](1:10 is a range of row numbers; you can change it to any range you like.)
The meaning of the columns
geneid Entrez gene identifier NKI.Z Z-score for each gene (for association with metastasis) UPP.Z as above for UPP datasetEach Z-score (also known as Z-test) correspond to a significance test for a coefficient in Cox regression model (positive score means high gene expression corresponds to poor survival or faster rate of metastasis). It can be converted to p-values and vice versa. The nice thing about it is that it is signed according to over- (positive) or under-expression (negative). The absolute value of a Z-score corresponds to a two-sided p-value through the cumulative distribution function of standard normal distribution (more about this below).
The rows (identified by geneid) do not match between the two datasets because they are from different platforms. The length of the tables are not even the same. To check their lengths:
length(NKI[,1]) length(UPP[,1])
Here we will use the (Entrez) geneid to join the datasets.
The geneid is unique and therefore can be used as a key. This is because the probes have been filtered such that
R has a command called merge() which allows two dataframes (tables) to be joined based on a common column. This is similar to natural join in relational database. By default, rows without match are discarded and columns with the same name are used as the join key.
comb <- merge( NKI, UPP)Check the combination
comb[1:10,]Now, the results from the two datasets have been matched. [Note: merge() use columns with the same in both datasets as the join keys by default. It is possible to specify arbitrary columns as join keys.]
Exercise 4 How many genes are found in both datasets?
Exercise 5 It is nice if we know what the genes are. Load the gene annotation file (geneid_symbol.txt) and create a new data set called comb2 with corresponding gene symbols joined.
Now that we have the data matched, we can proceed with the statistical analysis.
First, it's convenient to directly refer to the columns of comb2 as free standing variables.
attach(comb2)For example,
NKI.Z[1:10]gives the first ten values of NKI.Z. Without attach(), we have to refer to refer to this variable as comb2$NKI.Z.
To compute the two-sided p-values:
NKI.p <- 2*pnorm(abs(NKI.Z),lower.tail=FALSE) UPP.p <- 2*pnorm(abs(UPP.Z),lower.tail=FALSE)It is useful to show the ranks in individual dataset according to the p-values
NKI.rank <- rank(NKI.p) UPP.rank <- rank(UPP.p)Let us put the variables into a frame for easy display of sorted rows
comb3 <- data.frame(NKI.Z,NKI.p,NKI.rank,UPP.Z,UPP.p,UPP.rank,symbol)To show them sorted by the p-values:
comb3[order(NKI.p),][1:20,]and for the other dataset:
comb3[order(UPP.p),][1:20,]It seems that there is not much agreement in the rank of individual dataset.
To have a feeling of the agreement, we can make a scatter plot of the Z-score.
plot(NKI.Z,UPP.Z)Do we see any trend? Note that deviation towards upper-right direction corresponds to over-expression (and the opposite for under-expression). The closer the point to the main diagonal, the more consistent the Z-scores of the two datasets. Points in upper-left or lower-right far from the origin are inconsistent (the signs are reversed).
To check the correlation:
cor(NKI.Z,UPP.Z)To test if this is not just what encountered by chance (i.e., when the scores are randomly paired between the two datasets), just use the linear model:
summary(lm(NKI.Z ~ UPP.Z))Exercise 5a What is the correlation between NKI.Z and UPP.Z? What is the p-value of the test of linear correlation?
You may ask why don't we look at scatter plot of ranks or p-values. Try them.
Note that it is the order of magnitude of p-value that is more intuitively meaningful.
A much more useful quantity is -log(p). So,
plot(-log(NKI.p),-log(UPP.p))Here, we see also many points deviating somewhat unusually towards the upper-right direction. The Fisher's method is in fact testing this. [Note that the sum of the -log(p) in this plot correspond to locations in lines going from lower-left to upper-right.]
The main disadvantage of this plot (as opposed to comparison of Z-scores) is that information about over- or under-expression is lost if the p-values are two sided. If they are one-sided, you need separate plots for over- and under-expression. In the Z-scores plot, both are accomodated simultaneously.
library(stats) NKI.bonf <- p.adjust(NKI.p,method="bonferroni") UPP.bonf <- p.adjust(UPP.p,method="bonferroni") NKI.fdr <- p.adjust(NKI.p,method="fdr") UPP.fdr <- p.adjust(UPP.p,method="fdr")
Create significance lists. For example, say, with cutoff value of 0.05
NKI.sigbonf <- ifelse(NKI.bonf < 0.05, 1, 0) UPP.sigbonf <- ifelse(UPP.bonf < 0.05, 1, 0) NKI.sigfdr <- ifelse(NKI.fdr < 0.05, 1, 0) UPP.sigfdr <- ifelse(UPP.fdr < 0.05, 1, 0)The above simply assign 1 or 0 according to the significance.
To cross tabulate the combination:
table(NKI.sigbonf,UPP.sigbonf)How many genes are in the intersection? How many disagree? Bonferroni criteria produces almost no gene in UPP.
Try the same tabulation for FDR criteria. How does it differ? Are the genes in the intersection just by chance?
Fisher's exact test are often used to test this kind of tables (the so called "gene set testing"). The interpretation and appropriateness is still debated (see Goeman and Buehlmann 2007 Bioinformatics 23:980). However, it is a common method, so here it is:
fisher.test(table(NKI.sigfdr,UPP.sigfdr))This is a sign of "enrichment" in the intersection (we observe genes more frequently than expected).
To see which genes are in the intersection, we construct a new data frame:
sigdf <- data.frame(symbol,NKI.sigfdr,UPP.sigfdr)and then select those with the desired combination of indicators:
subset(sigdf,NKI.sigfdr == 1 & UPP.sigfdr == 1)
total.Z <- (NKI.Z + UPP.Z)/sqrt(2)The nominal two-sided p-value can be easily obtained:
total.p <- 2*pnorm(abs(total.Z),lower.tail=FALSE)Multiple testing adjustment:
total.fdr <- p.adjust(total.p,method="fdr") total.bonf <- p.adjust(total.p,method="bonferroni")Exercise 6 How many genes are significant under Bonferroni criteria of p < 0.05? How many under FDR < 0.05?
Bonferroni correction is known as a very conservative method (as we've seen in analysis of individual studies). However, we see here that when two datasets are combined using the above method, many genes can be found. Recall that only one gene is found in UPP using this criteria. The important point here is that datasets that seems to be "weak" individually, may reinforce one another when combined.
[Optional task: If you're interested and have time, check the EMC dataset (which you have downloaded but not yet enter into R). Think about how you can examine three datasets. ]
To see what the meta-analysis procedure does in ranking the genes, we can examine the table:
final <- data.frame(symbol,NKI.p,UPP.p,total.p) final[order(total.p),][1:20,]We see that ranking based on combined p-value also select genes that are reasonably strong in both datasets.
Top-ranking genes based on one dataset (particularly UPP which is the weaker one), may select genes with not so good overall p-value:
final[order(UPP.p),][1:20,]