Computational and statistical issues in co-analyzing gene expression data from multiple studies and platforms

Recently, expression data from cancer studies have been accumulating rapidly in public databases. For example, in breast cancer, data from more than 3000 arrays are available. Co-analyzing them together promises higher statistical power and more reproducible conclusions. Most commonly cited problem is the lack of comparability between expression measures. On the other hand, in classical meta-analysis, the main issue in multi-cohort analyses is the Simpson's paradox, which precludes pooling and direct comparison of data across cohorts, even if the measured variables are comparable. This requirement of stratified analysis solves the problem of expression measure comparability. However, many standard microarray analyses, such as clustering, significant and prediction analysis, need to be redesigned and reimplemented to incorporate stratified analysis.