Perform gene prioritization using our Web server

 Perform gene prioritization using our Java server




Presentation

The identification of key genes involved in health and disease remains a formidable challenge. We develop novel bioinformatics to prioritize candidate genes underlying biological processes or diseases. Currently, our prioritization strategies are based on how similar a candidate gene is to a profile derived from genes already known to be involved in the processes. Data from multiple heterogeneous sources (coding sequence, gene expression, annotation, literature, regulatory information, etc.) are integrated, or fused, into a global ranking of the candidates. Ongoing research tackles the extension of these strategies to data from multiple organisms (cross-species data fusion) using more elegant machine learning strategies (kernel methods) and how to perform prioritization in the absence of a training set.

Endeavour is a software application for the computational prioritization of candidates genes, based on a set of training genes. It is made up of three stages: training, scoring and fusion. In the first stage, information about the training genes (genes already known to play a role in the process under study) are retrieved from numerous data sources in order to build models. It includes functionnal annotations, protein-protein interactions, regulatory information, expression data, sequence based data and literature mining data. In the second stage, the models are then used to score the candidate genes and to rank them according to their scores. Lastly, the rankings per data source are fused into a global ranking using order statistics. Endeavour is available for human, mouse, rat, fruit fly and worm.

Starting from a locus reported to be associated with DiGeorge syndrome and using Endeavour, we were able to propose YPEL1 as an interesting candidate. We further showed that YPEL1 knock-out zebrafish embryos exhibit features that are compatible with the human DiGeorge phenotypes. More recently, we have used Endeavour to optimize a genetic screen in Drosophila melanogaster in which we aimed at discovering novel in vivo interactions with the developmental gene atonal. Starting from 180 deficiency lines, we identified 12 positives loci harboring more than 1100 genes in total. These loci were prioritized using Endeavour and only the genes in the top 30% were assayed resulting in the identification of 12 positive genes. Researchers have also used Endeavour to look for genes involved in cleft lip / cleft palate from aCGH data, and to analyze the proteome of adipocytes. Please browse our reference section to find a list of Endeavour related publications.

Data

Data from multiple heterogeneous sources are collected and integrated in our databases in order to perform gene prioritization. This includes sequence data (genomic sequences of the genes and protein sequences of their products), expression data (usually EST data or large data sets covering the expression of thousands of genes over a wide range of different tissues/samples), functional annotations (usually from ontologies designed to describe the function of the gene products, their cellular localization, and the biomolecular pathways they are involved in), protein-protein interaction networks (describing which products interact with which other products either physically or genetically), text mining data (to collect the information contained in the scientific literature, it completes the functional annotations already gathered), regulation information (a combination of known and predicted binding sites for transcription factors), a-priori probabilities (based on physical properties such as length of the coding sequence or number of introns). Data is collected regularly for human as well as for five major model organisms and stored locally for a better performance.

Algorithms

We have currently developed two distinct computational approaches that are both using the same data. Our first algorithm uses basic machine learning techniques to modelize the biological process under study and then to score and rank the candidate genes using that model. Ultimately, the rankings corresponding to different data sources are fused using the order statistics. Our second approach uses advanced machine learning methods (i.e., kernel methods) to prioritize the genes. We have benchmarked the two approaches using the same genetic disorder based benchmark and conclude that the kernel based algorithm performs slightly better than our previous approach.

Softwares

We have implemented the basic algorithm into an application termed Endeavour. It is a Java based client that can be started via Java Web Start. More recently, we have developed a web version that is more user friendly. However, it does not include all the options available in the Java client. Both tools are using the same core and thus give exactly the same results when running the same prioritization. The development of the kernel based application (with an improved performance) is on its way and should be made available during fall this year.

Research

Our current approach is based on the presence of a training set which might be a limitation when almost nothing is known about the disease under study. We are circumventing this problem in an alternative approach, which uses a gene expression data set that is specific to the genetic disorder under study, ideally it contains the gene expression levels in both the disease tissue and a reference tissue. Our method is also using, in conjunction, a gene network derived from multiple data sources. The differential expression levels are spotted onto the network and our algorithm pick up the candidates with a highly differentially expressed neighborhood. A second axis of research is about applying advanced machine learning techniques (e.g.,kernel based methods, bayesian methods) to the gene prioritization problem in order to discover which methods lead to the best performance. A third focus is the development of a cross-species version with which one could perform gene prioritization of human genes using not only human data but also data from model organisms such as rat, mouse, fruit fly and worm. Similarly to the kernel based version, the development of the cross-species endeavour is on its way and a beta version should be also made available during fall this year.