Overview |
Start
at bench |
Install
TOUCAN |
Get
Sequences |
Annotate |
MotifScanner |
Statistics |
ModuleSearcher |
MotifSampler |
Return to bench |
References
MotifScanner
- Search for Transcription Factor Binding Sites
We now want to scan
our dataset of 53 human promoter sequences for the presence of
transcription factor binding sites (TFBS). MotifScanner is one
of the
programs integrated in TOUCAN which can be used for this purpose.
Choose "Motifs", "MotifScanner", which brings up a window like the
following one:
Essentially, you have to make three selections here.
1. "PWM database" is the database of transcription factor
binding site profiles you want to use, which
is quite self-explanatory, e.g. for human sequences you may choose
"TRANSFAC 6.0 public - Vertebrates" (
Wingender et al.,
2001)
or you may choose
the
independent JASPAR database (Sandelin et al.,
2004). It should be mentioned in this context,
that for matters of comparison, it can be useful to perform several
runs using different PWM databases. A PWM (Position-Weight-Matrix) displays
a TFBS as matrix which indicates the experimentally determined
frequency of the four nucleotides at each position. The last column
represents the deduced consensus in IUPAC code. The following matrix
shows the binding profile of the factor NF-kappaB (p50) as
illustrative example, taken and modified from the
TRANSFAC
6.0 public database. Note that at ambigous
positions, IUPAC letters are used (like Y which stands for C or T).
AC M00051
XX
ID V$NFKAPPAB50_01
XX
DE NF-kappaB (p50)
XX
BF T00593 NF-kappaB1; Species: human, Homo sapiens.
XX
PO A C G T
01 0 0 18 0 G
02 0 0 18 0 G
03 0 0 18 0 G
04 2 0 16 0 G
05 16 1 0 1 A
06 0 0 3 15 T
07 0 7 1 10 Y
08 0 16 0 2 C
09 0 18 0 0 C
10 0 17 1 0 C
2. The "Background Model" takes
the "average promoter composition" within the species of interest
into consideration, and compares this information with the sequences in
the active sequence set. Naturally, if you are scanning human
promoter sequences you should use
a background model calculated from human promoter sequences as well,
like "EPD Human" based on human promoter sequences stored in the
Eukaryotic Promoter
database,
EPD (
Schmid et al.,
2004).
"Background
Model" lists orders of Markov Models, 3rd
order models are fine in most cases. In a 1st
order
background
model, the genomic frequencies are calculated for each
dinucleotide (AA, AT, etc.), so 1 bp
(1st order) before the actual bp that is being scored with the
background model
and matrix model. In a 2nd order
background model, the
score of a
nucleotide for the
background
model is the frequency of the trinucleotide (e.g.
AAT if T is being scored).
3. The "Prior"
value indicates the stringency
level which defines
if a sequence motif corresponds to a TFBS concensus. The higher
the "prior", the more instances of each motif will be found
, meaning a lower "prior" (like
0.1) is more stringent than
a higher one (like 0.9). In general, also the size of the
sequences analyzed should be taken into consideration, which is
reflected in the following "prior"-examples:
0.1-0.2 for sequences smaller than 300 bp, 0.9 for sequences larger
than 1500 bp.
Results are returned as GFF format. Annotate these on your
sequences simply by choosing "YES".
The output is a color-coded
list of TFBS in the "Feature list" on
the left side of the TOUCAN window, and a visualization of these sites
along the input
sequences in the main window ("Sequence set"). Within the "Feature
list", you can
right-click on
individual TFs to
show/hide/or
re-color, or you may
visualize features simply by selecting them and hitting the "Enter"-key
(also
works with CDS, exons, ...). Alternatively, you may click onto
individual boxes (TFBS) in the main window, in order to display their
properties in the "Feature Info" window, as shown for the factor NF-kappaB
(p50) in the following image (red). If you want to know which gene /
promoter you are analyzing at the moment, simply click into the region
of the first exon, which reveals this information in the
"Feature Info" window (cyan).