




|
Motif finding
People: Magali Lescot, Kathleen Marchal, Janick Mathys, Yves Moreau, Gert Thijs
A Gibbs Sampling algorithm to detect regulatory elements in sets of coexpressed gene (Web Interface)
Recent high-throughput techniques to monitor gene expression levels constitute an important advance in the identification of coexpressed genes. A major challenge to the computational biologist is to define novel regulatory elements (motifs) in such sets of coexpressed genes. Transcriptome analysis allows to detect and cluster genes that are coexpressed under various biological circumstances. Coregulated genes are known to share some similarities in their regulatory mechanism, possibly at the transcriptional level. This similarity implies that they might contain in their promoter region consensus motifs, recognized by the same regulatory proteins. Using this information, it is possible to investigate the cis-acting sequences controlling the transcription of these genes.
To find over-represented motifs in the set of upstream sequences we developed an algorithm based on the original Gibbs Sampling algorithm for motif finding by Lawrence et al. (1993). We implemented two extension to this algorithm. First we proposed a probabilistic sequence model to find the number of times the motif is repeated in each sequence in the data set. The second extensions is the use of a higher-order background model to improve the robustness of the algorithm to noisy data. The background model is described by a higher-order Markov process and is represented by a transition matrix. This transition matrix can be either calculated based on the input sequences or even better on an independent data set.
The use of a higher-order model considerably enhances the performance of our motif finding algorithm in the presence of noisy data. Data sets, in which the regulatory elements are known, were used first to give a proof of concept and next to test the influence of the different background models on the performance of the motif detection algorithm. To demonstrate the performance on a real life problem, we analyzed a microarray data set coming from physical wounding experiments in Arabidopsis thaliana. Several motifs known to be involved in the plant defense system were found.
|