Toucan Tutorial 5

 
 

In this tutorial we guide you through the process of analysing the cis-regulatory logic of a set of genes in human. The example data set consists of eight genes that were extracted from the ATLAS expression data. These genes are expressed in lung and not in any other tissue. The identifiers (affymetrix ids) of the genes can be found here.



Sequence retrieval

1

First we are going to download the genes from Ensembl. Click the menu item Get Seq -> From Ensembl -> new/add.


2

A popup window will appear in which we can define the genes and the corresponding parameters to download the sequences. First we select and open the file with identifiers.


3

Next, we define that we have a file with identifiers. The other option is to enter a list of comma-separated identifiers.




4

The identifiers are of the type 'AFFY_HG_U95Av2'. Other possible types of identifier are ensembl, embl, genbank, go, hugo, locuslink, interpro, pdb and many others. The mapping between these identifiers is based on the mapping stored in the ensembl database.




5

Next, we can add the sequences of orthologous genes to our sequence set. In this example we choose only the mouse orthologous genes, but many other organism are available. Again the mapping is extracted directly from the ensembl database.
The other parameter we change is the number of nucleotides we like to select around the gene. Here we set this number to 10,000.




6

When the data is submitted the download process starts. In the status window you can monitor this process.




7

When the sequences are all retrieved they are displayed in the application window together with the annotated gene, CDS, exons, five prime and three prime UTR. Normally the orthologous genes are displayed directle beneath the selected human gene. The human gene are displayed with the identifier given in the file, but internally the ensembl id is used. You can see this ensembl id in the gene information subwindow when you click on the gene name.




8

Some genes are locate on the minus strand. To ease the analysis and the visual interpretation we reverse complement these sequences. When doing so, the resemblance between orthologous becomes more obvious in most cases.




Pairwise Alignment

9

When all sequences are correctly imported in Toucan, we start with the first step in the analysis of the cis-regulatory logic. We would like to define the regions that are conserved between the human-mouse orthologs. Therefore we can use one of the available pairwise alignment algorithms, AVID, LAGAN, BlastZ. To find these conserved regions in all human-mouse pairs at once we choose Alignment -> all pairwise from the menu.




10

In the popup window we select all possible pairs. The suggested pairs are based upon the relation between the orthologous genes (this can also be seen in the gene information subwindow). For the moment, we choose AVID as alignment algorithm and leave the other parameters to their default value. After submitting the data, they are sent to our server where they are processed.




11

When the alignment results are ready, they are annotated as features on the respective sequences.




12

To better view the results we zoom in on the sequences and we also toggle the fill state of the conserved regions on one pair of orthologs. This picture nicely shows how several of the conserved regions correspond to the annotated exons. We also see four conserved regions upstream of the gene start. Those are the regions that will be of interest in the next analysis step.




13

Since for a few pairs AVID did not return any useful hits we try also another program for pairwise alignment. We start Lagan from the menu Alignment -> 2 seq -> LAGAN.




14

As the first sequence we select the human gene and the second is the corresponding mouse ortholog. Remember that toucan uses ensembl ids internally, while the original id (AFFY in this example) is shown in the visualization window. The corresponding ensembl id can be found in the gene information window when you click the gene name.
The other parameters are left untouched.




15

The conserved region found with Lagan are annotated on the respective sequences.




Syntenic region selection

16

In the next step we like to create a new dataset with all the conserved upstream regions. This can be achieved by right-clicking on such a region in the visualization window and use the cut option.




17

With the cut option, we can cut a specific part from the sequence and add this to the sublist or let this new part replace the old sequence. First we select the appriorate feature (the conserved region) from the pulldown menu. If we are interested in both the human and mouse sequences, we indicate that we like to cut all the features with the same type on all other sequences. For the subsequent step, we only need the regions on the human sequence. So we select only this feature and set the number of nucleotide selected on the left and right side to 0.




18

When done adding all individual regions to the sublist, we can save the sublist via the File -> Save sublist menu.




19

Since we only need the sequences for the subsequent steps, we save the syntenic regions in a fasta file. Therefore we give our filename the extension .fasta or .tfa.




Analysis of Syntenic Region

Now, for the further analysis start a new toucan session.

20

First load the saved fasta file into toucan.




22

We start MotifScanner to search for known motifs in this set of sequences.




23

As motif models we choose the vertebrate matrices from the Transfac public repository.




24

As background model we choose the model built from a set of human-mouse syntenic regions.




25

The prior of finding one motif in a sequence is set to 0.2.




26

When the results are ready, a popup window appears and the results are temporarily displayed in the status window. There are two option to save the results. The first one is to annotate the retrieved feature directly on the sequences. This is done by clicking on 'yes'. When selecting 'no' the retrieved features are saved to files. The first file contains the instances in GFF and the other file is a specific matrix file. When you select 'yes' the matrix file is not saved.




27

In this example just click on 'yes' to annotate the computed features directly on the sequences.




28

To have a first feeling of which motifs are overrepresented in this set of conserved regions, we will compute the statistical significance.




29

The statistical overrepresentation of the number of instance is measured by comparing the number with the expected number of instances based on a reference set. As a reference, we use here the expected frequencies computed by screening a large set of syntenic regions with the transfac matrices. The file can be found here.
When you use this statistics to measure overrepresentation be sure that you select the reference file with the right parameter settings. If the parameters of the reference data does not correspond to the one choosen to find the instances here then the results will be very biased.




30

There are six motifs that are statistically overrepresented in this dataset.




31

Here you can see the selected overrepresented motifs.




 

© 2004. Katholieke Universiteit Leuven.
Last updated: 2004-11-05.