Overview

  1. Clustering
  2. Gene Indexing
  3. Localization of Intergenic Regions
  4. Blast Report Parsing
  5. Sequence Selection
  6. MotifSampler
Clustering

Adaptive Quality-Based Clustering is specially designed by Frank De Smet to find groups of genes that have a similar expression profile given a set of microarray experiments. This form works with expression data in a tab-delimited ascii text file (format described below). The other parameters to be provided on this site are the users email address, an optional alpha-numeric identifier, the quality criterium or the minimal probability of a gene belonging to a cluster and the minimal number of genes in a cluster.

Data Format:


The resulst are send to the user as a URL link to a temporary HTML page. On this page, the expression profiles of all the cluster is given together with the identifiers of the genes in the cluster. On this page there is also a link provided to retrieve the upstream sequences of all the genes in the cluster.


Top
Gene Indexing

The Gene Indexing script tries to identify the gene name in the GenBank entry with the corresponding accession number. First, each sequence is retrieved from GenBank based on the accession number. This entry is subsequently parsed to find the corresponding gene. To efficiently find the genes in the sequence, an indexed list of genes will be created based on the annotations found in the GenBank file. To identify the gene the given name is matched with those names found in the annotation. This process is of course strongly dependent on the quality of the annotation. Therefore, the user is given the opportunity to check the retrieved sequences and can indicate whether or not a gene should be included in further analysis. The retrieved sequences are be classified into three classes:

Correctly indexed genes
The genes where there was no problem to identify gene with the corresponding gene name.

EST or mRNA
The retrieved sequence is an EST sequence, no annotation of a gene is found in the sequence.

Additional gene information
In this particular case the given gene name is not found in the parsed GenBank entry. The user should manually indicate which gene should be included for further analysis.

Possible Problems:


Top
Localization of Intergenic regions

The selection of the intergenic or upstream region of a gene is done based on the indexing of the genes in the previous step. There are three list created: (1)blast list of all the genes that should be blasted to find the intergenic region, (2) intergenic list of true intergenic regions and (3) long upstream list of upstream sequence that are longer than the minimal desired length.

EST or mRNA
If an EST or mRNA sequence is found, this sequence does normally not contain sufficient annotation to delineate the upstream region and is directly added to the blast list.

Intergenic region
If there exists in the same GenBank file an annotated gene upstream of the selected gene, the region between the two genes is selected. The intergenic region is defined as the non-coding region between two consecutive genes. This region is added to the intergenic list.

Long upstream sequence
If the gene of interest is either the first gene on the W+ strand or the last gene on the C- strand, the upstream region is selected. If this region is longer than the minimal desired length, this upstream region will be added to the long upstream list. The gene of interest will be added to the blast list to find another entry that contains the real intergenic region.

Short upstream region
If the upstream length did not meet the user defined length criterion the sequence will be added to the blast list.


Top
Blast Report Parsing

The sequences in the blast list are blasted at NCBI through the blastcl3 client program. The reports are parsed using the Bioperl module Bio::Tools::BPlite to detect the intergenic regions corresponding to the query genes. A flowchart of the parsing process is shown in the following figure:

In the results email there is also a link given to an overview page of the blast reports based on the process ID. Each blast report is summarised and only the important alignment details are given. Here is a description of the different lines in this summary:


Top
Sequence Selection

Here is a description of the different fields in the table:

  1. Checkbox to indicate whether or not a sequence will be selected for further analysis.
  2. Desired length: Set the desired length to truncate the sequence. If set to 0 the sequence will not be truncated.
  3. Accession Number of the retrieved sequence.
  4. Gene Index of the gene of interest in the retrieved sequence.
  5. Gene Name of the gene of interest.
  6. Indication of type of tag (upstream, transcript, mRNA, cds, gene) on which the selection of the intergenic region is based.
  7. Orientation of the selected region in the retrieved sequence. If this strand is -1 (C-) the reverse complement is given.
  8. Start Position of the selected region in the retrieved sequence.
  9. End Position of the selected region in the retrieved sequence.
  10. Descriptor of the original query sequence.


Top
Motif Sampler

The Motif Sampler tries to find over-represented motifs in the upstream region of a set of co-regulated genes. This motif finding algorithm uses Gibbs sampling to find the position probability matrix that represents the motif. In this implementation we focus on the use of higher-order background models to improve the robustness of the motif finding. At the moment the Motif Sampler comes with background models for several organisms (see pop up list further down the page). But the Motif Sampler is also suitable for other organisms since the background model can also be calculated from the input sequences.


Top

This page is maintained by Gert Thijs. Last update 2002/02/28.
Email: gert.thijs@esat.kuleuven.ac.be
Copyright © 2002, KULeuven.