In this tutorial we use Toucan as a workbench to run several motif finding tools. To illustrate the different steps we use a set of upstream sequences from 11 genes in S. Cerevisiae that are regulated by the Cbfl-Met4p-Met28p complex and Met31p or Met32p in response to methionine. These sequences were obtained from RSA tools. Upstream regions between -800 and -1 are selected. The consensus of both binding sites is given by TCACGTG for the Cbfl-Met4p-Met28p complex and AAAACTGTGG for Met31p or Met32p.
You can download the sequences used in this tutorial here.
To load the raw sequences in Toucan we use the File -> Load Seq uitility. Select the sequence with the file browser.
After the sequences are loaded in Toucan they are represented as lines in the application window. The main window exists of 6 subwindows that contain
A. Created sublists of sequences
B. List of features annotated on the sequences
C. Visualization of sequences and selected annotated features
D. Information about a clicked gene
E. Information about the feature that has been selected on the sequence
F. Status information about submitted tasks
In the first step we like to search for unknown overrepresented motifs using MotifSampler. For more information about MotifSampler you can check the web site.
To start the MotifSampler we select MotifSampler application from the Motifs menu. This will launch the parameter window.
Set the parameters of MotifSampler. Here we have set the number of runs to 10 and the number of motifs per run on 3. This will result in a list of maximal 30 distinct motifs. If you like to do more runs, you can download the stand-alone version. Information about how to use MotifSampler and to interprete the results can be found here.
To select the appropriate background model, we could either get the list of available models from the remote server or load a self-defined model. After clicking on 'Get' a list of precomputed models pops ip. Here we select the 3th order model from yeast.
Once all parameters are set, press OK to submit the data to the remote server where the computation are done. Be patient here, especially if you have set the number of runs to a large value. Be also aware that you should not close Toucan, otherwise your results will definitely be lost.
When the results are ready a popup window appears. There are two option to save the results. The first one is to annotate the retrieved feature directly on the sequences. This is done by clicking on 'yes'. When selecting 'no' the retrieved features are saved to files. The first file contains the instances in GFF and the other file is a specific matrix file. When you select 'yes' the matrix file is not saved.
In this example just click on 'yes' to annotate the computed features directly on the sequences.
This window shows the annotated features. To the left there appears a list wih the identifiers of all the motifs found. You can manipulate the appearance (color, fill) of one or more features by right-clicking on a name in this list.
The next step in the analysis is the search for known motifs. Therefore, we use first the MotifScanner tool. The algorithm is based on a probabilistic sequence model in which motif are assumed to be hidden in a noisy background sequence.
MotifScanner is started from the menu Motifs -> MotifScanner.
The important parameter here is the prior probability of finding one instance of the motif in a sequence. The lower the prior the lower the number of motifs found. The default value is here set to 0.1, which is rather conservative. We leave it like it is for the moment.
The next step is the selection of the appropriate set of matrices. A list of models is available on the server. You can access it by clicking the Get button. In this example we use the set of matrices from SCPD. If you like to only select one or a few motifs from the list of motifs, you can state that as well.
Once all parameters are set, press OK to submit the data to the remote server where the computation are done. The time of computation depends on the size of the dataset and the number of matrices.
To save the results the same procedure holds as described with MotifSampler. Except, in this case there will be no matrix file saved, when 'no' is selected.
Notice that in the status window (bottom) the full results, received from the remote server, appear before they are annotated. If something went wrong when processing the data this information will also appear in this window.
This window shows all the instances annotated on the respective sequences. To view only the newly added instances, use the control to select the SCPD feature in the left subwindow and then press 'enter' or right click and select 'show'. The number of instances is rather low because we have choosen a rather conservative prior of 0.1. If you like to see more instances you should increase the prior.
Another tool to find overrepresented is MotifLocator. This tool works like a classical position weight matrix scoring scheme. Unlike MotifScanner it does not take the sequence content into account. MotifLocator assigns a score to each individual site in the sequence, normalizes these score between 0 and 1, and output the instances with a score greater than the threshold.
Here we use the results of MotifLocator to quickly compare the results of MotifSampler with the examples in the Transfac database.
MotifLocator is started from the menu Motifs -> MotifLocator.
From the list of available motif models we select the fungi matrices from the Transfac database. Since we do not want to be to conservative in this test, we lower the threshold from 0.9 to 0.8. The background model is again the 3th order yeast model.
When the results are ready, we immediately annotated them on the sequences. To explore the results we have set the color of all instance found with MotifSampler to a light gray color and we have toggled the fill state of the Transfac instances (right-click on a motif identifier). Next, we select in the left subwindow all MotifSampler motifs and one of the Transfac matrices (crtl and left click) and display them (press enter after selection). We are now interested in the positions where the blue square overlaps with a gray box. By clicking on such a instance, we can view the related in formation in the feature window.
Statistical overrepresentation and visualization
Finally, we like to assess the statistical significance of the motifs found. Therefore, we use the binomial statistic that is present in Toucan. Start this analysis by selecting Motifs -> Stat. Over.rep..
The statistical overrepresentation of the number of instance is measured by comparing the number with the expected number of instances based on a reference set. This can either be an annotated sequence set or a expected frequencies file. The annotated reference set is a file of sequences in EMBL format. Such a set can for instance be made by screening a set of random sequences with the same parameters settings as used in the performed analysis. The other option is to use a file with expected frequencies. Here we have precomputed the expected frequency of the SCPD matrices from a screening of all upstream promoter sequences in the yeast genome. The expected frequency set for the motifs used in this example can be found here.
When you use this statistics to measure overrepresentation be sure that you select the reference file with the right parameter settings. If the parameters of the reference data does not correspond to the one choosen to find the instances here then the results will be very biased.
When the computation is done, the different motifs are sorted according to their respective significance coefficient. Those with a coefficient greater than 0 are considered being significant. In this example only the PHO4 motif with 5 instances is statistically overrepresented.
FInally, we have highlighted the overrepresented motif, SCPD-PHO4, together with the instances found with MotifSampler that overlap with this motif. As you can see the consensus of the motif found with MotifSampler is TCACGTGA, this corresponds to one of the known motifs as stated in the introduction of this tutorial.
The reason that we only find such a small number of instances is that the prior of 0.1 might have been to stringent. When we would like to select more instances then we better increase the prior.
To save the results, you should use File -> Save List. The format in which the results are saved depends on the extension you give to the file. To save the sequences only in fasta format, you use the extensions .tfa of .fasta. When you also want to save the annotated features you should use the EMBL format and give the file the extension .embl.