Overview | Start at bench | Install TOUCAN | Get Sequences | Annotate | MotifScanner | Statistics | ModuleSearcher | MotifSampler | Return to bench | References  
   
Promoter Sequence Retrieval from Ensembl

One of the features of TOUCAN is the automated in-batch extraction of promoter sequences from the Ensembl database (Birney et al., 2004). As example, we want to retrieve the promoter sequences of the 53 genes that build the K-means cluster 1, as seen in the section "cluster analysis" of the page "Start at bench". From the TOUCAN menu, choose "Get_Seq", "From Ensembl", "New/Add". In the next window you can either paste a comma - separated ID-list, or browse to a local file containing a column of IDs. Note that you may use a lot of different ID types, like LocusLink, RefSeq, HUGO, Interpro, Affymetrix, Ensembl, and many more. In our example, we use a list of human LocusLink IDs, separated by commas. In general, this process is supported not only for human IDs but also for many other species, like mouse, rat, chicken, zebrafish, or fruitfly.
Then we specify the sequence region to retrieve, which of course is not a trivial task for eukaryotic promoters. As a first "approximation", we choose 1 kb upstream of the first exon, and 0.2 kb downstream, meaning that we will extract also the first 200 bp of the transcribed region (first exon) of the gene. It is expected that in 5'-UTR regions, important regulatory elements may be located.
Finally, it should be mentioned that in version 2 of TOUCAN, it is highly convenient to retrieve the orthologous sequences of multiple other species "in the same run" via the field "Add multiple orthologs". For matters of simplicity, we will skip this option in our tutorial.
              
Get Sequences
             
A progress bar indicates the status of sequence download. In some cases, a dialog window appears, which displays the IDs of those genes which for some reason could not be found in the database. In our example, this is the case for only one of the 53 genes (ID 24147).
               
Get Sequences Error ID
                      
If you want to, you may manually retrieve this promoter sequence from genome browsers like the UCSC Browser (Kent et al., 2002), and download the region of interest in a FASTA-formatted sequence file. A detailed description would be beyond the scope of this tutorial. Briefly, you first have to query the ENTREZ Gene database of NCBI with this accession number ("24147"), which returns the respective entry for the gene FJX1. Then, choose "UCSC" from the dropdown-list which is displayed when hitting the "Links" item at the top of this page. This link will reveal the genomic region covered by the gene FJX1 in the UCSC Genome Browser. At the top of this page, you will see the genomic coordinates (chr11:35,596,631-35,598,988) of the region which is presented in the main window. Small arrowheads indicate the orientation of the gene on the forward strand ("from left to right"). Now, you simply have to type in the coordinates which you want to retrieve, e.g. 1 kb upstream and 0.2 kb downstream of the transcription start site (chr11:35,595,631-35,596,831). You can easily display this sequence as a FASTA-file using the "DNA" link at the top of the page, which also provides extended formatting options (like the possibility to reverse-complement sequences where the gene of interest is located on the reverse strand). Subsequently, you simply copy / paste the sequence into a word-processing software and save the file in *.txt format. Finally, you may add the sequence in TOUCAN to an already existing sequence set via the command "File", "Add Seq".
   
In the main TOUCAN window ("Sequence Set"), each line represents one gene / promoter. First, you may want to display the complete sequence regions within this window, which can be achieved via the "View", "Zoom Out" commands. All genes are visualized with their features (exon, 5'-UTR, CDS, gene) displayed as open rectangles. In the left window ("Feature List") all features can be seen, with their visualization color. Note that it is very easy to select / de-select individual features, simply by highlighting them in the "Feature List" and hitting the "Enter" key, or by a right mouse-click onto a feature and by choosing "Show" / "Don't show". First, genes are displayed in the orientation corresponding to their location in the genome. It is recommended to reverse-complement those genes where the coding sequence is located on the reverse strand (exon boxes below the black line), simply by right-clicking onto a gene and selecting "RevCompl" from the context menu. Note that in version 2 of TOUCAN, there is a very convenient way to reverse - complement ALL sequences which lie on the reverse strand at once, via "Tools", "RevCompl Negatives" !

Reverse Seqs
                  
This leads to a "unified" sequence set.

Seq revcompl
                 
File Saving and Export
     
Using "File", "Save List" you can save the active sequence set in various formats. When you choose the ".embl" format, then annotations like exon positions, features like TF sites from MotifScanner (see later) are saved along with the sequences, and can be recovered in TOUCAN using "File", "Load Seq" at a later time point. Note that when re-opening a sequence set, the indicated IDs are always Ensembl GeneIDs, independent from the ID format which was initially used to retrieve the sequences. You may also save the sequence set in FASTA format.
The command "File", "Export" provides several options for data export. "Export without Ns" generates a FASTA file of all sequences. "Export Figure" lets you save the active figure as a JPEG- or PNG- image. "Export Frequencies" creates a tab-delimited *.txt-file with the names and frequencies of the current features. "Export GFF" generates a special file format, which is easily opened in e.g. MS EXCEL, representing all features in table format (incl. e.g. the sequences of TF binding sites !). "File", "Export Matrix" generates a *.txt-file with all features in columns and all sequences in rows. "File", "Export Separate FASTA" generates a folder containing all sequences individually in FASTA format.


Previous <       > Next