Endeavour Web Server Manual
The candidate gene prioritization problem
In the last decade, many different high-throughput technologies have been developed and they are now extensively used in biology.
However, some in-silico methods have been proposed recently to perform that task. Here, we present ENDEAVOUR (article), a system based on gene similarity that uses a data-fusion algorithm.
ENDEAVOUR is based on gene similarity.
The underlying assumption is that genes involved in the same disease (or in the same biological pathway) are more similar one to each others than to the rest of the genome. In our case, 'similar' can mean that their protein sequences present similarities, or that their proteins are physically interacting within cells, or that they share functional regulation motifs for instance. Thus, ENDEAVOUR considers that the more similar gene within the list is the best candidate.
ENDEAVOUR uses a data-fusion algorithm.
There are several biological databases that contain relevant informations about genes (or their products). They differ by the type of data they collect (microarray data, interactions data, sequences data, ...) and also by the quality of the data (manually curated data, in-silico derived data, predicted data, ...) The informations are sometimes redundant, sometimes contradictory and never complete enough to be used alone. The idea of the data-fusion algorithm is to use many different data-sources and combine them so that the combination is more powerful than any of the data-source alone.
|We are currently using the following data-sources. Unless specified, the data were downloaded on September the 30th of 2007.
A three step algorithm.
- In the first step, we build the model, with one sub-model per source of information. To do so, we collect all the informations about the disease (or pathway) under study using the gene already known to be involved in the process. Data are collected and analyzed. The resulting informations are used to build a sub-model. This is done iteratively for each data-source.
- In the second step, we use the model built in the first step to score the candidates. For each source, the candidates are ranked according to their similarities to the appropriate sub-model. At this point, we obtain a set of r rankings, one per data-source.
- In the third step, we combine the r rankings into one global ranking using the order statistics.
Description of the functionnalities
A standard prioritization consists of five steps :
- Building of a training set, which consists of genes already known to play a role in the process under study.
- Selection of the data-sources to use.
- Building of a candidate set, the list to prioritize.
- Launching the prioritization.
- Analyzing the produced results.
How to build a training/candidate set ?
First, the user needs to define the organism
he is working with. To do so, he must select the correct organism name
in the menu.
Then browse the "Training genes" panel, and input the text area at the
bottom with EnsEmbl gene ids, gene symbols, chromosomal bands, Kegg
pathways ids, Gene Ontology ids or/and Omim ids. Some type should be
prefixed with a keyword so that the program can recognize the type, the
following table contains the needed informations.
|Gene identifier||ENSG00000184895||ENSMUSG00000071964||ENSRNOG00000012772||WBGene00000966||The gene which main identifier matches exactly the input|
|Gene symbol||OPTN||Tmem58||TAGL||T09E11.8||The gene which symbol matches exactly the input|
|Chromosomal region||chrX:||chr1:0-100000||chr11:10000-500000||chr5:100000-300000||chrII:0-200000||All genes located in the chromosomal region|
|Chromosomal band||chr:||chr:1p36||chr:11A4||chr:3q22||Not supported||All genes located in the chromosomal band|
|Kegg||kegg:||kegg:05211||kegg:04540||kegg:00230||kegg:00624||All genes involved in the given Kegg pathway|
|Gene Ontology||go:||go:0019321||go:0005747||go:0004114||go:0006421||All genes annotated with the given GO term|
|Omim||omim:||omim:parkin||Not supported||Not supported||Not supported||All genes involved in a disease that matches partially the input|
- If the user changes the species when genes are already loaded then these genes are removed.
- A gene is not added to a set in which it's already present.
- A gene present in the training set can also be present in the candidate set then introducing some bias.
- The recycling bin should be used to remove a gene from the set.
- When no genes are added, the console (at the bottom) contains crucial informations to find what could be the problem.
- The case of the input does not matter.
How to select the data-sources ?
Within the "Models" panel, the user should tick the boxes in front of the data-sources he wants to use.
- A small description of each data-source can be found when pointing to the blue interrogation mark.
- Some data-sources are available for all organisms (Gene Ontology, Swissprot, Text,...) while some others are species-specific (Ouzounis, ProspectR, Cis-regulatory module,...).
How to launch the prioritization ?
The "Launch prioritization" button located in the 4th
step of the wizard should be used. The prioritization itself can take
several minutes depending on the numbers of candidates and the numbers
of data-sources. The prioritization is done when the results panels are
- If the "Launch prioritization" button is disabled, it means that the user cannot launch the prioritization because he has not yet done the required steps (selection of training genes, candidate genes and data-sources).
How to analyze the results ?
Three panels are displayed in the results (below the wizard and the console) as soon as the algorithm returns, two of them can also be found in the Java GUI (see Endeavour project main page).
The first panel - Sprint plot - presents a graphical overview of the prioritization results. The first column corresponds to the global ranking of the genes, while the others give the ranking for each selected data source. The top 16 genes are attributed a color in order to easily find the rank of a given gene obtained for each model (selected data source). When pointing one box, the appropriate gene description is displayed in a pop-up giving to the user more informations about that gene. When a gene name has a red color, it means that this gene obtained a maximum dissimilarity score (most dissimilar to the training genes). When a gene name is displayed in a line through font, this means that the gene was not scored for the data source.
The second panel - Table - presents the prioritization results in a table. For each candidate gene and each data-source, its score, rank and rank ratio are displayed. Additionally, the overall rank and rank ratio are also presented. The ranks and rank ratios are calculated from the scores. Please refer to the Endeavour paper for further informations. The candidate genes are sorted according to the overall ranking by default, but the user can sort them according to any data-source.
The third panel - Export -
(previously XML) allows to save the results, either in XML format to be
able to save and reuse the prioritization results, or as CSV to be able
to exploit the results for example with spreadsheet software.
Example: discovery of a novel Usher gene.
This section describes a small prioritization procedure (click on the pictures to enlarge), being the finding of a novel Usher syndrome gene within the chromosomal band 9q32. A recent publication by Ebermann et al has shown that DFNB31 was responsible for causing Usher when mutated.
Usher syndrome is a genetic disorder that implies both retinitis pigmentosa (disease of the eye) and hearing impairment (disease of the hear). More informations can be found on the nidcd website or on wikipedia. The gene already known to be involved in the Usher syndrome when altered are collected in the OMIM database. Pubmed also collect all the scientific publications and thus contains informations about the syndrome. According to OMIM and Pubmed, eight genes are already known to be involved in the Usher syndrome.
The species panel
|The first thing to do is choose an organism to work with. By selecting Homo sapiens in the menu, I make sure that I work with human genes.|
The training genes panel
|Then, by typing CLRN1 ENSG00000042781 USH1G ENSG00000006611 PCDH15 ENSG00000137474 CDH23 ENSG00000164199 and pressing the "Add" button, I start the loading of the 8 already known Usher genes. When loading is achieved, the 8 rows are displayed on the table area.|
The data-sources panel
|I select the data-sources I want to use based on my expertise and on the small description displayed when pointing one data-source.|
The candidate genes panel
|By typing chr:9q32 and pressing the "Add" button, I start the loading of the 32 genes located on the human chromosomal band 9q32. Like before, when loading is achieved, the 32 rows are displayed on the table area.|
The sprint plot panel
|Accessible when the prioritization is done, it's the graphical version of the results.|
The results panel
|Accessible only when the prioritization is done, it's the final results of the algorithm. Global as well as per data-source rankings and p-values are shown. Here, we observe that DFNB31 ranks first out of the 32 genes of the region.|