![]() |
TOUCAN FAQ | ![]() |
Where to find expected frequency files (for the statistical analysis), and how were they made?
|
|
|
The statistical analysis to find over-represented motifs in a set of sequences uses the binomial formula. It is explained in this paper: http://nar.oupjournals.org/cgi/content/abstract/31/6/1753. Essentially, the observed frequency of each PWM instance in the set (that is the number of instances divided by the total number of base pairs in the set), as predicted with for example MotifScanner, is compared with the expected frequency of the same motif. This expected frequency has to be calculated from a reference set. This can be a user-defined set (see next question), or it can be a large collection of randomly selected sequences (or all sequences) from the genome of the same species. We have calculated expected frequencies for large sets of promoter sequences and of conserved non-coding regions (human-mouse, 10kb upstream, 75% idendity in 100bp), scored with the TRANSFAC library of PWMs. This means that you can only use them to do a statistical analysis if you have scored your sequence set with the TRANSFAC library (this can be chosen in the MotifScanner dialog within TOUCAN). They can be downloaded here (also see the README.txt file in this directory): ftp://ftp.esat.kuleuven.ac.be/pub/sista/aerts/software/freqfiles
|
|
|
|
First you score your sequence set with a library of PWMs (e.g., JASPAR or TRANSFAC), using MotifScanner or MotifLocator. Then you export (File - Export menu) the expected frequencies to a file. You can then point to this file when you perform the statistical analysis. If you wish to use a very large reference set (e.g., all human promoters retrieved from EnsMart), then it is best to download the MotifScanner algorithm (here: http://www.esat.kuleuven.ac.be/~thijs/download.html) and score the sequences locally because TOUCAN will run out of memory when trying to visualize the result. The resulting GFF file can be transformed into an expected frequency file using this perl script: GFF2Freq.pl (provide the fasta file and the GFF file as command line arguments).
|
|
|
|
Yes, because the scores are calculated by the same method. Just select both MotifScanner and MotifSampler sources in the ModuleSearcher dialog (hold down the ctrl key).
|
|
|
|
Ctrl + click on a conserved region (or right click and choose cut). You come straigth into the Sequence Select window. Here select the conserved region, and choose to select all features in the set with the same source. This will put all subsequences that are annotated as conserved region (e.g., because their source is AVID), in the sublist. When you now run MotifScanner, select to run it "only on the sublist". This way, only the conserved regions will be scored. After you've annotated the results, you can now run ModuleSearcher. Since the non-conserved regions were not included in the MotifScanner scoring, they do not contain predictions of binding sites. The ModuleSearcher will therefore not take these regions into account. The result of the ModuleSearcher is then a module that is always located within a conserved region.
|
|
|
|
The background models that are available within TOUCAN are n-th order Markov chains, that were trained on large sequence sets (e.g., all promoters in a genome, all conserved non-coding sequences, etc.). If your prefered background model is not listed in TOUCAN, you can built one yourself using the program CreateBackgroundModel, that can be downloaded here: http://www.esat.kuleuven.ac.be/~thijs/download.html. You can upload this model in the MotifScanner, MotifLocator, or MotifSampler dialog windows (choose Browse instead of Get).
|
|
|
|
ModuleSearcher searches for the best similar cis-regulatory modules, present in all sequences. If you run ModuleSearcher only on the sublist, different conserved regions on one sequence are considered together. Thus, the best cis-regulatory module present in one of the two (or both if it is large) is returned. If you open the sublist in a new window and then run ModuleSearcher, for each conserved region, a (putative) cis-regulatory module is found. Off course, the first way is most biologically relevant.
|