Overview |
Start
at bench |
Install
TOUCAN |
Get
Sequences |
Annotate
|
MotifScanner |
Statistics |
ModuleSearcher |
MotifSampler |
Return to bench |
References
Promoter
Sequence Retrieval from Ensembl
One of the features
of TOUCAN is the automated in-batch extraction of promoter
sequences from the Ensembl
database (
Birney
et al.,
2004)
. As example, we
want to retrieve the promoter sequences of the
53 genes that build the K-means cluster 1, as seen in the
section "cluster analysis" of the page "Start at bench". From the
TOUCAN menu, choose "Get_Seq", "From Ensembl", "New/Add". In
the next window you can either paste a comma - separated ID-list, or
browse to a local file containing a column of IDs. Note
that you may use a lot of different ID types, like LocusLink,
RefSeq, HUGO, Interpro, Affymetrix, Ensembl, and many more. In our
example, we use a list of human LocusLink IDs, separated by commas. In
general, this process is supported not only for human IDs but also for
many other species, like mouse, rat, chicken, zebrafish, or fruitfly.
Then we specify the sequence region to retrieve, which of
course is not
a trivial task for eukaryotic promoters. As a first "approximation", we
choose 1 kb upstream of the first exon, and 0.2 kb downstream, meaning
that we will extract also the first 200 bp of the transcribed region
(first exon) of the gene. It is expected that in 5'-UTR
regions, important regulatory elements may be located.
Finally, it
should be mentioned that in version 2 of TOUCAN, it is highly
convenient to retrieve the orthologous sequences of multiple
other
species "in the same run" via the field "Add multiple orthologs". For
matters of simplicity, we will skip this option in our tutorial.
A progress bar indicates the status of sequence download. In
some cases, a dialog window appears, which displays the IDs of those
genes which for some reason could not be found in the database. In our
example, this is the case for only one of the 53 genes (ID 24147).
If you want to, you may
manually retrieve this promoter sequence
from genome browsers
like the
UCSC
Browser (
Kent et
al., 2002), and download the region of interest in a
FASTA-formatted
sequence file. A detailed description would be beyond the scope of this
tutorial. Briefly, you first have to query the
ENTREZ
Gene database of NCBI with this accession number ("24147"), which
returns the respective entry for the gene
FJX1.
Then, choose "UCSC" from the dropdown-list which is displayed when
hitting the "Links" item at the top of this page. This link will reveal
the
genomic
region covered by the gene FJX1 in the UCSC Genome Browser. At the
top of this page, you will see the genomic coordinates
(chr11:35,596,631-35,598,988) of the region which is presented in the
main window. Small arrowheads indicate the orientation of the gene on
the forward strand ("from left to right"). Now, you simply have to type
in the coordinates which you want to retrieve, e.g. 1 kb upstream and
0.2 kb downstream of the transcription start site
(chr11:35,595,631-35,596,831). You can easily display this sequence as
a FASTA-file using the "DNA" link at the top of the page, which also
provides extended formatting options (like the possibility to
reverse-complement sequences where the gene of interest is located on
the reverse strand). Subsequently, you simply copy / paste the sequence
into a word-processing software and save the file in *.txt format.
Finally, you may add the sequence in TOUCAN to an
already existing sequence set via the command "File", "Add Seq".
In the main TOUCAN window
("Sequence Set"), each line
represents one
gene / promoter. First, you may want to display the complete sequence
regions within this window, which can be achieved via the "View", "Zoom
Out" commands. All genes are visualized with their features (exon,
5'-UTR, CDS, gene) displayed as open rectangles. In the left window
("Feature List") all features can be
seen, with their visualization color. Note that it is very easy to
select / de-select individual features, simply by highlighting them in
the "Feature List" and hitting the "Enter" key, or by a right
mouse-click onto a feature and by choosing "Show" / "Don't show".
First, genes are displayed in the orientation corresponding to their
location in the genome. It is recommended to
reverse-complement
those
genes where the coding sequence is located on the reverse strand (exon
boxes below the black line), simply by right-clicking onto a gene and
selecting "RevCompl" from the context menu. Note that in version 2 of
TOUCAN, there is a very convenient way to reverse - complement ALL
sequences which lie on the reverse strand at once, via "Tools",
"RevCompl Negatives" !
This leads to a "unified" sequence set.
File Saving and Export
Using "File", "Save List"
you can save the active sequence set in various formats. When you
choose the ".embl"
format, then annotations like exon
positions, features like TF sites from MotifScanner (see later) are
saved along with the
sequences, and can be recovered in TOUCAN using "File", "Load Seq" at a
later time point. Note that when re-opening a sequence set, the
indicated IDs are always Ensembl GeneIDs, independent from the ID
format which was initially used to retrieve the sequences. You may also
save the sequence set in FASTA format.
The command "File", "Export"
provides several options for data export. "Export without Ns"
generates
a FASTA file of all sequences. "Export
Figure" lets you save the active figure as a JPEG- or PNG- image. "Export
Frequencies" creates a tab-delimited *.txt-file with the names and
frequencies of the current features. "Export GFF" generates a
special file
format, which is easily opened in e.g. MS EXCEL, representing all
features
in table format (incl. e.g. the sequences of TF binding sites !).
"File", "Export
Matrix" generates a *.txt-file with all features in columns and all
sequences in rows. "File",
"Export Separate FASTA"
generates
a folder containing all sequences individually in FASTA format.