Overview |
Start
at bench |
Install
TOUCAN |
Get
Sequences |
Annotate |
MotifScanner |
Statistics |
ModuleSearcher |
MotifSampler |
Return to bench |
References
Statistics - Find over-represented Features
After having
performed a MotifScanner analysis, we are interested to know which of
the predicted features (TFBS) are statistically over-represented,
meaning which TFBS might be biologically important for the regulation
of gene expression within our cluster of co-expressing genes. For this
purpose, choose
"Motifs", "Stat. Over-rep." from the TOUCAN menu. You have to
provide
a local
"reference file" (in the field
"Expected Frequencies File"), calculated from a previous analysis of
this
species' promoters stored in databases, which by comparison to your own
dataset allows an estimation of significance of single TF sites. You
may create your own reference
file, using "File", "Export Frequencies" from a different promoter
dataset of the same species, but it may be advantageous to download a
frequencies file from the
TOUCAN
FTP-site, like for human promoters the file
"epd_homo-sapiens_prior0.1.freq", based on the Eukaryotic Promoter
Database,
EPD (
Schmid et al.,
2004). Just save this file to your local
machine, and use it as input for
the field "Expected Frequencies File". Make sure that you download
the file
with the same prior you had used for the MotifScanner run ! After
hitting the "Start" button, the result is quickly displayed.
The
output is a table showing 3 values for each TFBS. The
motifs are sorted according to descending sig-value.
"n"
denotes the number of times this feature (TF
site) appears in your sequence set. Note that this does not
tell you the number of promoters containing this TFBS, because a
feature might appear more than once in one
sequence. This means that TOUCAN
indicates over-representation of a feature if it occurs in a
certain number
of base pairs, and not in a certain number
of sequences. The
"p" value indicates the probability to find
even more occurrences than n in this number of base
pairs. When analyzing only one feature, a p-value smaller than 0.05
could be selected as being over-represented. In case of multiple
features, it is better to use the
"sig" (significance)-value.
One expects to find at
random
one pattern with sig >=1 every ten families, one with sig>=2
every 100 families, and one with sig >=s every 10
s
families.
Generally, we may
state that
negative
sig values mean not significant. Note that you can easily
copy / paste the output table into programs like MS EXCEL.
In this case, we can clearly see that 4 of the 7 TF-matrices which are
predicted to be over-represented in our set of 53 promoters, represent
binding sites for the transcription factor
NF-kappaB (
de Martin et al.,
2000). This
finding, in fact, is a very nice positive control for our analysis, as
it is well established that NF-kappaB is a central mediator of
inflammatory processes, like the one induced by interleukin-1. The fact
that binding sites for this factor are over-represented in our dataset
strongly supports the validity of this approach. The "top-scorer" in
this list is
Sp-1 (Stimulating protein 1), a well known general
activator within promoter sequences. Interestingly, another factor is
predicted to be over-represented in this sequence set,
SREBP-1 (Sterol regulatory
element-binding protein 1), a transcriptional activator, acting
through sterol regulatory elements (SRE).
Now, you may want to
selectively display only the statistically
over-represented TFs in the TOUCAN window, in order to analyze the
distribution of these sites along the sequence set. For this purpose,
you first right-click onto all features in the "Feature list" and
choose "Don't show" and then highlight those matrices which you want to
visualize and choose "Show". Alternatively, highlight all matrices of
interest and hit the "Enter" key.
The image below shows that NF-kappaB and Sp-1 are quite "evenly"
distributed whereas SREBP-1 is only found in a small subset of
promoters in this cluster. Note that for matters of clarity, all
NF-kappaB matrices are displayed in red. TOUCAN also provides the
opportunity to identify "by hand" ("by eye") regulatory "hot-spots",
like in the example below the region containing two clustered NF-kappaB
sites together with one SREBP-1 site (red circle) within the promoter
of the gene TNFAIP3. These "hot-spots" may be sites for further
functional characterization.