|
In this tutorial we guide you through the process of analysing the cis-regulatory logic of a set of genes in human.
The example data set consists of eight genes that were extracted from the ATLAS expression data. These genes are
expressed in lung and not in any other tissue. The identifiers (affymetrix ids) of the genes can be found here.
Sequence retrieval

First we are going to download the genes from Ensembl. Click the menu item Get Seq -> From Ensembl ->
new/add.

A popup window will appear in which we can define the genes and the corresponding parameters to download the
sequences. First we select and open the file with identifiers.

Next, we define that we have a file with identifiers. The other option is to enter a list of comma-separated
identifiers.

The identifiers are of the type 'AFFY_HG_U95Av2'. Other possible types of identifier are ensembl, embl, genbank, go,
hugo, locuslink, interpro, pdb and many others. The mapping between these identifiers is based on the mapping stored in
the ensembl database.

Next, we can add the sequences of orthologous genes to our sequence set. In this example we choose only the mouse
orthologous genes, but many other organism are available. Again the mapping is extracted directly from the ensembl
database.
The other parameter we change is the number of nucleotides we like to select around the gene. Here we set this number to 10,000.

When the data is submitted the download process starts. In the status window you can monitor this process.

When the sequences are all retrieved they are displayed in the application window together with the annotated gene, CDS,
exons, five prime and three prime UTR. Normally the orthologous genes are displayed directle beneath the selected human
gene. The human gene are displayed with the identifier given in the file, but internally the ensembl id is used. You can
see this ensembl id in the gene information subwindow when you click on the gene name.

Some genes are locate on the minus strand. To ease the analysis and the visual interpretation we reverse complement
these sequences. When doing so, the resemblance between orthologous becomes more obvious in most cases.
Pairwise Alignment

When all sequences are correctly imported in Toucan, we start with the first step in the analysis of the cis-regulatory
logic. We would like to define the regions that are conserved between the human-mouse orthologs. Therefore we can use
one of the available pairwise alignment algorithms, AVID, LAGAN, BlastZ. To find these conserved regions in all
human-mouse pairs at once we choose Alignment -> all pairwise from the menu.

In the popup window we select all possible pairs. The suggested pairs are based upon the relation between the
orthologous genes (this can also be seen in the gene information subwindow). For the moment, we choose AVID as alignment
algorithm and leave the other parameters to their default value. After submitting the data, they are sent to our server
where they are processed.

When the alignment results are ready, they are annotated as features on the respective sequences.

To better view the results we zoom in on the sequences and we also toggle the fill state of the conserved regions on
one pair of orthologs. This picture nicely shows how several of the conserved regions correspond to the annotated
exons. We also see four conserved regions upstream of the gene start. Those are the regions that will be of interest in
the next analysis step.

Since for a few pairs AVID did not return any useful hits we try also another program for pairwise alignment. We start Lagan from the
menu Alignment -> 2 seq -> LAGAN.

As the first sequence we select the human gene and the second is the corresponding mouse ortholog. Remember that toucan
uses ensembl ids internally, while the original id (AFFY in this example) is shown in the visualization window. The
corresponding ensembl id can be found in the gene information window when you click the gene name.
The other parameters are left untouched.

The conserved region found with Lagan are annotated on the respective sequences.
Syntenic region selection

In the next step we like to create a new dataset with all the conserved upstream regions. This can be achieved by
right-clicking on such a region in the visualization window and use the cut option.

With the cut option, we can cut a specific part from the sequence and add this to the sublist or let this new
part replace the old sequence. First we select the appriorate feature (the conserved region) from the pulldown
menu. If we are interested in both the human and mouse sequences, we indicate that we like to cut all the features with
the same type on all other sequences. For the subsequent step, we only need the regions on the human sequence. So we
select only this feature and set the number of nucleotide selected on the left and right side to 0.

When done adding all individual regions to the sublist, we can save the sublist via the File -> Save sublist
menu.

Since we only need the sequences for the subsequent steps, we save the syntenic regions in a fasta file. Therefore we
give our filename the extension .fasta or .tfa.
Analysis of Syntenic Region
Now, for the further analysis start a new toucan session.

First load the saved fasta file into toucan.

We start MotifScanner to search for known motifs in this set of sequences.

As motif models we choose the vertebrate matrices from the Transfac public repository.

As background model we choose the model built from a set of human-mouse syntenic regions.

The prior of finding one motif in a sequence is set to 0.2.

When the results are ready, a popup window appears and the results are temporarily displayed in the status window. There
are two option to save the results. The first one is to annotate the retrieved feature directly on the sequences. This
is done by clicking on 'yes'. When selecting 'no' the retrieved features are saved to files. The first file contains
the instances in GFF and the other file is a specific matrix file. When you select 'yes' the matrix file is not saved.

In this example just click on 'yes' to annotate the computed features directly on the sequences.

To have a first feeling of which motifs are overrepresented in this set of conserved regions, we will compute the
statistical significance.

The statistical overrepresentation of the number of instance is measured by comparing the number with the expected
number of instances based on a reference set. As a reference, we use here the expected frequencies computed by screening
a large set of syntenic regions with the transfac matrices. The file can be found here.
When you use this statistics to measure overrepresentation be sure that you select the reference file with the right parameter settings. If the parameters of the reference data does not correspond to the one choosen to find the instances here then the results will be very biased.

There are six motifs that are statistically overrepresented in this dataset.

Here you can see the selected overrepresented motifs.
|