Overview | Start at bench | Install TOUCAN | Get Sequences | Annotate | MotifScanner | Statistics | ModuleSearcher | MotifSampler | Return to bench | References  
   
Statistics - Find over-represented Features

After having performed a MotifScanner analysis, we are interested to know which of the predicted features (TFBS) are statistically over-represented, meaning which TFBS might be biologically important for the regulation of gene expression within our cluster of co-expressing genes. For this purpose, choose "Motifs", "Stat. Over-rep." from the TOUCAN menu. You have to provide a local "reference file" (in the field "Expected Frequencies File"), calculated from a previous analysis of this species' promoters stored in databases, which by comparison to your own dataset allows an estimation of significance of single TF sites. You may create your own reference file, using "File", "Export Frequencies" from a different promoter dataset of the same species, but it may be advantageous to download a frequencies file from the TOUCAN FTP-site,  like for human promoters the file "epd_homo-sapiens_prior0.1.freq", based on the Eukaryotic Promoter Database, EPD (Schmid et al., 2004). Just save this file to your local machine, and use it as input for the field "Expected Frequencies File". Make sure that you download the file with the same prior you had used for the MotifScanner run ! After hitting the "Start" button, the result is quickly displayed.

The output is a table showing 3 values for each TFBS. The motifs are sorted according to descending sig-value. "n" denotes the number of times this feature (TF site) appears in your sequence set. Note that this does not tell you the number of promoters containing this TFBS, because a feature might appear more than once in one sequence. This means that TOUCAN indicates over-representation of a feature if it occurs in a certain number of  base pairs, and not in a certain number of sequences. The "p" value indicates the probability to find even more occurrences than n in this number of base pairs. When analyzing only one feature, a p-value smaller than 0.05 could be selected as being over-represented. In case of multiple features, it is better to use the "sig" (significance)-value. One expects to find at random one pattern with sig >=1 every ten families, one with sig>=2 every 100 families, and one with sig >=s every 10s families. Generally, we may state that negative sig values mean not significant. Note that you can easily copy / paste the output table into programs like MS EXCEL.
     
Statistics
    
In this case, we can clearly see that 4 of the 7 TF-matrices which are predicted to be over-represented in our set of 53 promoters, represent binding sites for the transcription factor NF-kappaB (de Martin et al., 2000). This finding, in fact, is a very nice positive control for our analysis, as it is well established that NF-kappaB is a central mediator of inflammatory processes, like the one induced by interleukin-1. The fact that binding sites for this factor are over-represented in our dataset strongly supports the validity of this approach. The "top-scorer" in this list is Sp-1 (Stimulating protein 1), a well known general activator within promoter sequences. Interestingly, another factor is predicted to be over-represented in this sequence set, SREBP-1 (Sterol regulatory element-binding protein 1), a transcriptional activator, acting through sterol regulatory elements (SRE).

Now, you may want to selectively display only the statistically over-represented TFs in the TOUCAN window, in order to analyze the distribution of these sites along the sequence set. For this purpose, you first right-click onto all features in the "Feature list" and choose "Don't show" and then highlight those matrices which you want to visualize and choose "Show". Alternatively, highlight all matrices of interest and hit the "Enter" key. The image below shows that NF-kappaB and Sp-1 are quite "evenly" distributed whereas SREBP-1 is only found in a small subset of promoters in this cluster. Note that for matters of clarity, all NF-kappaB matrices are displayed in red. TOUCAN also provides the opportunity to identify "by hand" ("by eye") regulatory "hot-spots", like in the example below the region containing two clustered NF-kappaB sites together with one SREBP-1 site (red circle) within the promoter of the gene TNFAIP3. These "hot-spots" may be sites for further functional characterization.
        
Statistics2_TopTFs
          

Previous <       > Next