I am concerned with all the aspects of genomic data analysis. However, most frequently, data originates from either microarray or quantitative PCR experiments. My domain of predilection remains statistical pattern recognition and its applications to life sciences.

Current research projects and themes

  • TSP family of classifiers

    Top Scoring Pairs (TSPs) are simple bivariate classifiers that predict the class label based on the relative ordering of the two variables. As the ranking is more stable and reproducible across data sets and, more importantly, microarray platforms, the TSPs are attractive solutions to various classification problems.
    However, obtaining the list of TSPs requires computing some scores for all pairs of features, a task that is time consuming in the case of high dimensional data. To solve this problem, we provide a C++ implementation that makes use of the OpenMP libraries that exist on most modern multi-core machines.

    We also extend the original TSP algorithm by allowing weighted scoring shemes, different ways of combining the individual predictions of the TSPs and providing a solution to the multi-class problems. For this, we decompose the problem in a series of binary classification problems: let 1,..., C be the original labels, then for any pair of labels i, j, such that 1 <= i < j <= C, we construct the TSP classifier for discriminating i (re-labelled as 0) from j (re-labelled as 1). We collect all the TSPs, corresponding to all the C(C-1)/2 pairs (i,j), and their predictions into a 0/1 matrix with n (number of samples) rows and m (number of TSPs) columns. This matrix is fed into ctree() classifier, together with the original labels 1,2,...,C, to produce a classification tree. This tree has as its nodes individual TSPs (single pairs of the original variables), with branches corresponding to the 0 or 1 predicted labels (by the TSPs). The leaves will be the original 1,2,...,C labels. This process can be seen as a transformation of the original feature space (e.g. gene expression levels) into a binary feature space, given by the predictions of the TSPs.

    Software: The development platform is Linux (Ubuntu 9.10, GCC 4.4.1, R-2.11.1), and is the only one on which we can actively support our software. The Windows binaries were obtained from the source package and compiled using GCC/G++ (from MinGW distribution) with the corresponding GOMP (GNU OpenMP) and pthreads libraries (included in the distribution).
    Please note that the package depends on party package, available from CRAN repository.

    Before installing, you might want to check if OpenMP is supported on your system. For this, download the file test_openmp.cpp and compile it using

    g++ -o test_openmp -fopenmp test_openmp.cpp -lgomp

    Note that -lgomp might not be needed anymore for the newer versions of gcc. If OpenMP is supported (both by your compiler and operating system), running the program should print something like (for 8 cores):

    Using OpenMP!
    ---> Maximum number of threads available: 8
    ---> Number of processing units: 8
    ---> Using dynamic thread allocation strategy.
    Running 80 iterations.

    followed by a number of lines of the form Iteration: n
    If OpenMP is supported, you should have no troubles installing the Rgtsp packge.
  • MAQC-II

    MAQC-II is an FDA-led initiative for the standardization of methodology for biomarker identification studies. There is a lack of accepted standards for biomarker validation, for biological interpretation of results and for demonstrating comparability of conclusions. The initiative compares methods for selection and validation of biomarkers from microarray data, paying particular attention to robustness, flexibility and reproducibility of the classification system. Besides the contribution to the mainstream effort of the project, by designing and implementing a data analysis plan compliant with FDA's requirements, I focussed also on more specific issues like the study of the effect of classification problem complexity/difficulty on the optimal combination of feature selection and classification methods.
  • Selection of control genes

    We propose a meta-analysis approach to selecting candidate control genes. This has the advantage of being platform- and normalization-independent and of being able to integrate predefined list of genes as well. The first step is to score the genes from a dataset and to rank them accordingly. Here is a plot showing the scores (color-coded) from a dataset:

    R code for scoring and aggregating the gene ranks from several datasets is available here.
  • Segmentation of tiling array data

    Segmenting the tiling array data is a challenging task due to high level of noise that affects the measurements. We introduce a wavelet–based denoising step in the process of segmentation and we prove its efficiency on simulated and real–world data. This denoising step has the advantage of improving the accuracy of the segmentation while also reducing the execution time and memory requirements. Here is an example of such segmentation of yeast's 1st chromosome:
  • Tumor scoring using qPCR/microarray analysis

    This is a long term project whose goal is to design one or several molecular signatures with prognostic value in breast cancer survival and treatment prediction.
  • Breast cancer data analysis

    I am involved in a number of projects concerned with analysis of the breast cancer microarray data. One of these projects is MAQC-II, a US project aimed at validating classifiers built on microarray predictors.

CSS Valid & XHTML 1.0 Strict Valid