Query-driven module discovery in microarray data

T. Dhollander, Q. Sheng, K. Lemmens, B. De Moor, K. Marchal, Y. Moreau

Algorithm

Existing (bi)clustering methods for microarray data analysis usually reveal global patterns in the data, which often do not answer the questions that are of specific interest to a biologist. We describe a Bayesian probabilistic method to address the biclustering problem, when biologists query the data with a set of seed genes that they believe to have a common function. The problem is to recruit other genes that might have the same function as the seed genes, and in the meantime identify the relevant experimental conditions for this function, by finding genes that have similar expression profiles as the seed genes under a subset of experimental conditions.

The core of the probabilistic framework we use in this paper is based on our previous work, the main ingredients being a column-wise statistical model for the bicluster expression, a column-wise statistical model for the back-ground expression, and hidden labels for the genes and conditions to indicate bicluster membership. In (Sheng et al., 2003), we applied a Bayesian hierarchical model for discretized data with a multinomial likelihood and a Dirichlet prior to the problem of biclustering patients (finding clusters of patients that have similar expression patterns over a subset of their genes). The same model can be extended for query-driven biclustering of patients by imposing a tailored Dirichlet prior. In this paper however, we focus on models with column (condition) wise Normal likelihoods and conjugate Normal-Inverse chi square priors for the dual problem, in which we bicluster genes. Moreover, strong prior distributions (representing prior knowledge from the seed) allow us to use Conditional Maximization instead of Gibbs sampling.

A derivation for the formulas of the full conditional distributions, a comment on the scoring measures (module recovery, bicluster relevance) and some extra information on the resolution sweep approach can be found here.

Software

The R software for query-driven biclustering can be downloaded here.

Results on artificial data

We systematically evaluated our algorithm on several artificial data sets in two scenarios S1 and S2, containing noiseless overlapping modules (A) and noisy non-overlapping modules (B). The artificial data were taken from the supplementary website of (Prelic et al., 2006). In a first scenario (S1), the data consist of 10 binary modules (expression value 1) embedded in a zero background (expression value 0). This simple setup is complicated by adding Gaussian noise with standard deviation up to 0.25 (A) or allowing module overlap (B). In (A), the modules have 10 genes and 5 conditions, while the modules in (B) each contain 10 + k genes and 10 + k conditions, where k is the number of genes and conditions in common between overlapping modules. A second scenario (S2) describes a similar case (same module sizes), only this time the data is not binary but continuous. The background values are samples from a Gaussian distribution, the bicluster values in each column are equal (B) or equal up to some Gaussian noise (A). For details, we refer to (Prelic et al., 2006). No preprocessing (such as discretization) was performed. We did not apply the output filtering procedure described in (Prelic et al., 2006) to remove heavily overlapping biclusters or limit the number of modules in the output.

In all experiments, the seed consisted of genes correctly belonging to one of 10 artificial modules. We repeated the biclustering process 10 times, each time with randomly selected seed genes from a different module. The resulting biclusters were then scored with module recovery and bicluster relevance scores as described in (Prelic et al., 2006) and in Supplementary File 1. The module recovery score indicates how well the gene content of the ‘ideal’ modules is on average reflected in the (best matching bicluster in the) bicluster results. The bicluster relevance score is related to the relevance of the set of modules in the output. Both scores are maximal and equal to one if both module sets are equal.p>

The corresponding module recovery and bicluster relevance scores can be found here. For a discussion of these results, we refer to the main text. It is important to keep in mind some shortcomings of artificial data analysis .

Results on combined Spellman and Gasch dataset

We evaluated the performance of our approach by applying it to a concatenation of two well-known yeast expression compendia: the Gasch (Gasch et al., 2000) and Spellman (Spellman et al., 1998) dataset. The expression data sets are identical to the ones used in (Lemmens et al., 2006). In most cases, we were able to find significantly enriched biclusters associated with functions similar to those described in (Lemmens et al., 2006). Additionally, we gain information through condition selection and relationships between functions, and the suggested approach is robust against noise.

Most cell cycle seeds ultimately evolve into ribosome biogenesis related modules, while most nutrient-deprived seeds evolve over nitrogen com-pound metabolism into aerobic respiration and more general energy-related functions. For galactose metabolism seeds we did not observe any function changes over the tested resolution range. In general, the observed function changes are often transitions from specific Gene Ontology Biological Process classes to more general classes (examples include changes from ‘mitotic cell cycle’ to ‘cell division’, from ‘ATP synthesis coupled electron transport’ to ‘generation of precursor metabolites and energy’, from ‘M phase of mitotic cell cycle’ to ‘regulation of cell cycle’ and from ‘external encapsulating structure organization and biogenesis’ to ‘cytokinesis, completion of separation’). When strong priors are used (as we did throughout the paper by requiring the biclusters to be centered on the seed), the seed genes will remain part of the biclusters at all resolutions. This implicates that they may be involved in a number of overlapping functional modules (specific modules and more general ones).

Download zip file with resolution sweep plots (with non-corrected and Benjamini-Hochberg FDR corrected p values) for 104 Gasch seeds and the corresponding zip file with gene and condition scores of the (automatically) selected modules. More information about the latter (txt) files can be found here.

Download zip file with resolution sweep plots (with non-corrected and Benjamini-Hochberg FDR corrected p values) for 20 Spellman seeds and the corresponding zip file with gene and condition scores of the (automatically) selected modules. More information about the latter (txt) files can be found here.

References

Gasch,A.P. et al. (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, 11, 4241-4257.

Lemmens,K. et al. (2006) Inferring transcriptional modules from ChIP-chip, motif and microarray data. Genome Biol., 7, R37.

Prelic,A. et al. (2006) A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics., 22, 1122-1129.

Sheng,Q. et al. (2003) Biclustering microarray data by Gibbs sampling. Bioinformatics., 19 (Suppl 2), II196-II205.

Spellman,P.T. et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273-3297.