Endeavour Web Server Manual



The candidate gene prioritization problem

Biological context

    In the last decade, many different high-throughput technologies have been developed and they are now extensively used in biology.

  • Using the microarray technology, we can measure the expression level of thousands of genes simultaneously. By doing several experiments, we can measure the expression levels within a diseased cell and compare them to the levels obtained within a 'normal' cell. The genes found to be over-expressed (or under-expressed) in the disease sample, compared to the normal sample, are thought to be implied in the disease under study. The major problem in that case is that such a list can contain hundreds of genes and thus a wet-lab validation remains too expensive.
  • On the contrary of the microarray technology, CGH arrays looks at the DNA level of the gene and thus allow us to detect small chromosomal aberrations. Usually, this technology is used when the gene underlying the disorder is not yet known. After analysis, the list of genes lying within the altered region is thought to contain the disease causal gene. Like before, the main problem is that this list contains hundreds of genes. So, there is a need for a method that can pick up the best candidate within the list.
Biological context of gene prioritization
Biological context of gene prioritization

However, some in-silico methods have been proposed recently to perform that task. Here, we present ENDEAVOUR (article), a system based on gene similarity that uses a data-fusion algorithm.

The concept

ENDEAVOUR is based on gene similarity.

    The underlying assumption is that genes involved in the same disease (or in the same biological pathway) are more similar one to each others than to the rest of the genome. In our case, 'similar' can mean that their protein sequences present similarities, or that their proteins are physically interacting within cells, or that they share functional regulation motifs for instance. Thus, ENDEAVOUR considers that the more similar gene within the list is the best candidate.

ENDEAVOUR uses a data-fusion algorithm.

    There are several biological databases that contain relevant informations about genes (or their products). They differ by the type of data they collect (microarray data, interactions data, sequences data, ...) and also by the quality of the data (manually curated data, in-silico derived data, predicted data, ...) The informations are sometimes redundant, sometimes contradictory and never complete enough to be used alone. The idea of the data-fusion algorithm is to use many different data-sources and combine them so that the combination is more powerful than any of the data-source alone.

We are currently using the following data-sources. Unless specified, the data were downloaded on September the 30th of 2007.
  • Annotation sources – a gene is annotated with several terms that describe its functions, domains, ... :
  • Interaction sources – a gene network is built, each gene having a number of neighbors :
  • Microarray datasets – each gene is represented by an expression profile for numerous conditions/tissues :
    • Su et al microarray data (article).
    • Son et al microarray data (article).
    • Baugh et al microarray data (article)
    • Hovatta et al microarray data (article)
    • Lindsley et al microarray data (dataset)
    • Walker et al microarray data (article)
  • Disease probabilities – a gene is linked to a probability of being involved in a disease in general :
    • Ouzounis et al probabilities (article).
    • ProspectR probabilities (article).
  • Other sources :
    • Motif data – putative motifs in the upstream region of the gene (website).
    • Cis-regulatory modules – combination of n motifs that can potentially co-regulate a set of genes (website).
    • Blast scores – protein sequence similarities using BLAST (website).
    • Text-mining data – keywords find in abstracts of scientific publications (website).
Concept of data fusion.
Concept of data fusion.

A three step algorithm.

  • In the first step, we build the model, with one sub-model per source of information. To do so, we collect all the informations about the disease (or pathway) under study using the gene already known to be involved in the process. Data are collected and analyzed. The resulting informations are used to build a sub-model. This is done iteratively for each data-source.
  • In the second step, we use the model built in the first step to score the candidates. For each source, the candidates are ranked according to their similarities to the appropriate sub-model. At this point, we obtain a set of r rankings, one per data-source.
  • In the third step, we combine the r rankings into one global ranking using the order statistics.
Training step.
Training step.
Scoring step.
Scoring step.
Fusion step.
Fusion step.

Description of the functionnalities

    A standard prioritization consists of five steps :

  • Building of a training set, which consists of genes already known to play a role in the process under study.
  • Selection of the data-sources to use.
  • Building of a candidate set, the list to prioritize.
  • Launching the prioritization.
  • Analyzing the produced results.

How to build a training/candidate set ?

    First, the user needs to define the organism he is working with. To do so, he must select the correct organism name in the menu. Then browse the "Training genes" panel, and input the text area at the bottom with EnsEmbl gene ids, gene symbols, chromosomal bands, Kegg pathways ids, Gene Ontology ids or/and Omim ids. Some type should be prefixed with a keyword so that the program can recognize the type, the following table contains the needed informations.

Type Prefix Examples Gene(s)
Human Mouse Rat Worm
Gene identifier ENSG00000184895 ENSMUSG00000071964 ENSRNOG00000012772 WBGene00000966 The gene which main identifier matches exactly the input
Gene symbol OPTN Tmem58 TAGL T09E11.8 The gene which symbol matches exactly the input
Chromosomal region chrX: chr1:0-100000 chr11:10000-500000 chr5:100000-300000 chrII:0-200000 All genes located in the chromosomal region
Chromosomal band chr: chr:1p36 chr:11A4 chr:3q22 Not supported All genes located in the chromosomal band
Kegg kegg: kegg:05211 kegg:04540 kegg:00230 kegg:00624 All genes involved in the given Kegg pathway
Gene Ontology go: go:0019321 go:0005747 go:0004114 go:0006421 All genes annotated with the given GO term
Omim omim: omim:parkin Not supported Not supported Not supported All genes involved in a disease that matches partially the input

Important notes:

  • If the user changes the species when genes are already loaded then these genes are removed.
  • A gene is not added to a set in which it's already present.
  • A gene present in the training set can also be present in the candidate set then introducing some bias.
  • The recycling bin should be used to remove a gene from the set.
  • When no genes are added, the console (at the bottom) contains crucial informations to find what could be the problem.
  • The case of the input does not matter.

How to select the data-sources ?

    Within the "Models" panel, the user should tick the boxes in front of the data-sources he wants to use.

Important notes:

  • A small description of each data-source can be found when pointing to the blue interrogation mark.
  • Some data-sources are available for all organisms (Gene Ontology, Swissprot, Text,...) while some others are species-specific (Ouzounis, ProspectR, Cis-regulatory module,...).

How to launch the prioritization ?

    The "Launch prioritization" button located in the 4th step of the wizard should be used. The prioritization itself can take several minutes depending on the numbers of candidates and the numbers of data-sources. The prioritization is done when the results panels are displayed.

Important notes:

  • If the "Launch prioritization" button is disabled, it means that the user cannot launch the prioritization because he has not yet done the required steps (selection of training genes, candidate genes and data-sources).

How to analyze the results ?

    Three panels are displayed in the results (below the wizard and the console) as soon as the algorithm returns, two of them can also be found in the Java GUI (see Endeavour project main page).

    The first panel - Sprint plot - presents a graphical overview of the prioritization results. The first column corresponds to the global ranking of the genes, while the others give the ranking for each selected data source. The top 16 genes are attributed a color in order to easily find the rank of a given gene obtained for each model (selected data source). When pointing one box, the appropriate gene description is displayed in a pop-up giving to the user more informations about that gene. When a gene name has a red color, it means that this gene obtained a maximum dissimilarity score (most dissimilar to the training genes). When a gene name is displayed in a line through font, this means that the gene was not scored for the data source.

    The second panel - Table - presents the prioritization results in a table. For each candidate gene and each data-source, its score, rank and rank ratio are displayed. Additionally, the overall rank and rank ratio are also presented. The ranks and rank ratios are calculated from the scores. Please refer to the Endeavour paper for further informations. The candidate genes are sorted according to the overall ranking by default, but the user can sort them according to any data-source.

    The third panel - Export - (previously XML) allows to save the results, either in XML format to be able to save and reuse the prioritization results, or as CSV to be able to exploit the results for example with spreadsheet software.

Example: discovery of a novel Usher gene.

    This section describes a small prioritization procedure (click on the pictures to enlarge), being the finding of a novel Usher syndrome gene within the chromosomal band 9q32. A recent publication by Ebermann et al has shown that DFNB31 was responsible for causing Usher when mutated.

    Usher syndrome is a genetic disorder that implies both retinitis pigmentosa (disease of the eye) and hearing impairment (disease of the hear). More informations can be found on the nidcd website or on wikipedia. The gene already known to be involved in the Usher syndrome when altered are collected in the OMIM database. Pubmed also collect all the scientific publications and thus contains informations about the syndrome. According to OMIM and Pubmed, eight genes are already known to be involved in the Usher syndrome.




The species panel

The species selection panel.
The species selection panel.
The first thing to do is choose an organism to work with. By selecting Homo sapiens in the menu, I make sure that I work with human genes.

The training genes panel

The training genes panel waiting the validation from the server.
The training genes panel waiting the validation from the server.
Then, by typing CLRN1 ENSG00000042781 USH1G ENSG00000006611 PCDH15 ENSG00000137474 CDH23 ENSG00000164199 and pressing the "Add" button, I start the loading of the 8 already known Usher genes. When loading is achieved, the 8 rows are displayed on the table area.
The training genes panel after the validation from the server.
The training genes panel after the validation from the server.

The data-sources panel

The data-sources panel.
The data-sources panel.
I select the data-sources I want to use based on my expertise and on the small description displayed when pointing one data-source.

The candidate genes panel

The candidate genes panel waiting the validation from the server.
The candidate genes panel waiting the validation from the server.
By typing chr:9q32 and pressing the "Add" button, I start the loading of the 32 genes located on the human chromosomal band 9q32. Like before, when loading is achieved, the 32 rows are displayed on the table area.
The candidate genes panel.
The candidate genes panel.

The sprint plot panel

The sprint plot panel after the validation from the server.
The sprint plot panel after the validation from the server.
Accessible when the prioritization is done, it's the graphical version of the results.

The results panel

The results panel.
The results panel.
Accessible only when the prioritization is done, it's the final results of the algorithm. Global as well as per data-source rankings and p-values are shown. Here, we observe that DFNB31 ranks first out of the 32 genes of the region.