Toucan Tutorial
This tutorial describes the analysis of putative regulatory regions in a set of liver-specific genes.
1. Retrieve sequences from Ensembl
- In the upper text field enter gene names or any other gene identifier. Choose comma-separated list or id-file (text file with 1 column of id's) ,organism and type of identifier.
- Choose the type of sequence you want to retrieve: complete gene with flanking regions, sequence upstream of the coding sequence (CDS), or the sequence upstream of Exon1, which in most cases is the transcription start site.
- Choose how many base pairs you want to retrieve, upstream (before) and downstream (within) of CDS/TSS. In case of complete gene, the "bp before" are the flanking bases
- If the same sequence of an orthologous gene should be retrieved, then pick the organism in the drop-down list
- Choose add to current list, or create a new gene list
- Hit enter. If you have chosen more than 2 genes, you will see a progress bar

- All genes that couldn't be found appear in a dialog window.

If none of your genes could be found, or none of the orthologs, then check if you have selected the right organism and identifier. If that doesn't work, either update your preferences (according to the current Ensembl databases), or check the Toucan web site to see if there is a new properties file (see supporting files).
- All genes are visualized with their features (by parsing the EMBL-formatted sequence files). In the left window all features can be seen, with their visualization color.
- Genes on the positive chromosomal strand have their exon and CDS features above the thin black horizontal line. Genes on the negative chromosomal strand have these features below this line
- Right click on a gene to remove it or reverse complement it
(click for full scale image)
2. Align orthologs
- Choose Tools->Services->VISTA

- Select first and second sequence, and enter parameters
- Tip: it is helpful if you first reverse complement all sequences with the CDS on the negative strand
- After hitting OK, the sequences are sent to the server, AVID and VISTA programs are run, and the result is sent back in GFF format

- If you choose YES, the results are annotated directly on the active sequence set
- If you choose NO, you can save the results as GFF file. This file can be annotated later using Annotation->GFF
- Do this for every pair and annotate the results

3. Select regions
- Right click on a feature (e.g. a homologues region) and choose "cut". Or press CTRL together with a left click on a feature

- Choose the exact feature, and the amount of base pairs to select left and right from the feature
- Alternatively, you can specify the exact bases to be cut
- Cut regions can be saved in a sublist, which can be seen in the top-left window
- This sublist can be saved, choose File->save sublist, and reloaded as a new gene list

4. Score Position Weight Matrices
- Choose Tools->Services->MotifScanner

- Select a database of motifs, and a background model. Either from a local file, or by choosing "GET". This returns a list of PWM databases, and a list of background models from that are available at our servers.
- The results are returned as GFF (like VISTA results). Annotate these on your sequences simply by choosing "YES"

- Visualizing some of the features can be done by selecting them in the list on the left, and hitting enter (then only these become visible)
- If you want to unhide a feature, right click on it in the list, and choose "Show"
5. Statistics
- Choose Tools->Statistics

- Select a file with frequencies by pressing the second Browse button
- Frequency files can be downloaded from our web site
- Hit the Start button
- The binomial formula is used to calculate p-values and significance scores

- The motifs are sorted according to descending sig-value
- Now choose some significant motifs, and view only these on your annotated gene set (see above). Try to detect modules