INCLUSive Help Pages

Overview

Clustering
Gene Indexing
Localization of Intergenic Regions
Blast Report Parsing
Sequence Selection
MotifSampler

Clustering

Adaptive Quality-Based Clustering is specially designed by Frank De Smet to find groups of genes that have a similar expression profile given a set of microarray experiments. This form works with expression data in a tab-delimited ascii text file (format described below). The other parameters to be provided on this site are the users email address, an optional alpha-numeric identifier, the quality criterium or the minimal probability of a gene belonging to a cluster and the minimal number of genes in a cluster.

Data Format:

Data file should be a tab-delimited ascii text file.
All fields should be tab separated.
The first two columns should contain identifiers and may not contain expression data
All lines starting with a '#' are discarded. If there are lines in your file that do not contain measurements and that do not start with a '#', these lines will certainly corrupt the input.
To obtain the best results the data are best log-transformed. It is not necesarry to normalize the data, this is done within the core of the algorithm.
If you like to extract upstream sequences of the genes in your cluster:
1. The first column should contain the accession number of the sequence.
2. The second column should contain the gene name, this field may not contain any tabs.
If you do not have an accession number or a gene name, you should put other information in these two columns to identify the genes. Always make sure that the first two columns do not contain the expression levels.
All the other columns contain the expression levels as numerical values. If there are some missing values in your data you can leave them blank or substitute them by NaN. This is to indicate it is 'not a number'.
If you have any questions about the data format take a look at the example or feel free to contact us.

The resulst are send to the user as a URL link to a temporary HTML page. On this page, the expression profiles of all the cluster is given together with the identifiers of the genes in the cluster. On this page there is also a link provided to retrieve the upstream sequences of all the genes in the cluster.

Top

Gene Indexing

The Gene Indexing script tries to identify the gene name in the GenBank entry with the corresponding accession number. First, each sequence is retrieved from GenBank based on the accession number. This entry is subsequently parsed to find the corresponding gene. To efficiently find the genes in the sequence, an indexed list of genes will be created based on the annotations found in the GenBank file. To identify the gene the given name is matched with those names found in the annotation. This process is of course strongly dependent on the quality of the annotation. Therefore, the user is given the opportunity to check the retrieved sequences and can indicate whether or not a gene should be included in further analysis. The retrieved sequences are be classified into three classes:

Correctly indexed genes: The genes where there was no problem to identify gene with the corresponding gene name.
EST or mRNA: The retrieved sequence is an EST sequence, no annotation of a gene is found in the sequence.
Additional gene information: In this particular case the given gene name is not found in the parsed GenBank entry. The user should manually indicate which gene should be included for further analysis.

Possible Problems:

Warning: No accession number for the gene. There is no accession number available and therefore the entry cannot be retrieved from GenBank.
Warning: Sequence not found in GenBank. Either the input consisted of a non-existing accession number or the sequence retrieval system might be down.
Warning: Problems parsing sequence. There might be an error in the retrieved sequence and therefore the parsing of the sequence failed.

Top

Localization of Intergenic regions

The selection of the intergenic or upstream region of a gene is done based on the indexing of the genes in the previous step. There are three list created: (1)blast list of all the genes that should be blasted to find the intergenic region, (2) intergenic list of true intergenic regions and (3) long upstream list of upstream sequence that are longer than the minimal desired length.

EST or mRNA: If an EST or mRNA sequence is found, this sequence does normally not contain sufficient annotation to delineate the upstream region and is directly added to the blast list.
Intergenic region: If there exists in the same GenBank file an annotated gene upstream of the selected gene, the region between the two genes is selected. The intergenic region is defined as the non-coding region between two consecutive genes. This region is added to the intergenic list.
Long upstream sequence: If the gene of interest is either the first gene on the W+ strand or the last gene on the C- strand, the upstream region is selected. If this region is longer than the minimal desired length, this upstream region will be added to the long upstream list. The gene of interest will be added to the blast list to find another entry that contains the real intergenic region.
Short upstream region: If the upstream length did not meet the user defined length criterion the sequence will be added to the blast list.

Top

Blast Report Parsing

The sequences in the blast list are blasted at NCBI through the blastcl3 client program. The reports are parsed using the Bioperl module Bio::Tools::BPlite to detect the intergenic regions corresponding to the query genes. A flowchart of the parsing process is shown in the following figure:

In the results email there is also a link given to an overview page of the blast reports based on the process ID. Each blast report is summarised and only the important alignment details are given. Here is a description of the different lines in this summary:

The first line contains the query descriptor.
List of the all the subject sequences that are aligning with the query sequence. For each subject sequence the identification and a table with the most important parameters of all HSPs is given. This table consists of the following parts:
1. The query line contains the position of the match in the query sequence and the percentage of matching nucleotides is indicated in bold.
2. The subject line contains the corresponding positions of the match in the subject sequence.

Top

Sequence Selection

Here is a description of the different fields in the table:

Checkbox to indicate whether or not a sequence will be selected for further analysis.
Desired length: Set the desired length to truncate the sequence. If set to 0 the sequence will not be truncated.
Accession Number of the retrieved sequence.
Gene Index of the gene of interest in the retrieved sequence.
Gene Name of the gene of interest.
Indication of type of tag (upstream, transcript, mRNA, cds, gene) on which the selection of the intergenic region is based.
Orientation of the selected region in the retrieved sequence. If this strand is -1 (C-) the reverse complement is given.
Start Position of the selected region in the retrieved sequence.
End Position of the selected region in the retrieved sequence.
Descriptor of the original query sequence.

Top

Motif Sampler

The Motif Sampler tries to find over-represented motifs in the upstream region of a set of co-regulated genes. This motif finding algorithm uses Gibbs sampling to find the position probability matrix that represents the motif. In this implementation we focus on the use of higher-order background models to improve the robustness of the motif finding. At the moment the Motif Sampler comes with background models for several organisms (see pop up list further down the page). But the Motif Sampler is also suitable for other organisms since the background model can also be calculated from the input sequences.

Top