
The aim is to analyze the structure and evolution of genomes and use this information to identify regulatory interactions among several genes. The work will be organized along a first track focusing on genome structure and evolution (annotation; organization and comparative genomics; evolution) and a second track aiming at the identification of regulatory interactions, in particular by developing regulatory sequence analysis tools for the identification of cis-regulation elements and regulatory functions of microRNAs.
This subworkpackage covers annotation (i.e, the correct identification of genes and their structure) in the genome sequence. Besides protein-encoding genes, many other entities such as genes encoding RNAs, transposable elements and pseudogenes may interfere with genome replication and expression. Their correct identification is crucial to any further genome-based analysis.
Apart from the activity of transposable elements (TEs), mainly gene and genome duplications shape the structure and evolution of eukaryotic genomes. However, many questions remain as to how exactly these duplications (next to the evolution of cis-regulation, see further) have contributed to the evolution of developmental processes and the general increase of biological complexity. Therefore, to understand how gene duplication and the subsequent divergence of genes and their interactors have enhanced biological complexity, genetic networks need to be inferred (see also network inference).
Besides the detection of putative genes, another essential aspect of genome annotation is the detection of regulatory motifs. Existing bioinformatics techniques for this problem based on either word counting methods or on probabilistic methods will be developed in this workpackage. As they correspond to the genomic signatures for transcriptional regulation they form an important input for WP3 (data integration).
High-throughput genomics and proteomics create huge amounts of data that need to be analyzed to extract biologically meaningful information. Specific methods and algorithms will be developed for the analysis of large amounts of data from genotyping, transcriptomics, proteomics, interactome and metabolome analyses to identify or localize genes, proteins, metabolites involved in specific phenotypes or cellular processes of interest.
The majority of medically important phenotypes (e.g., susceptibility to cancer, cardiovascular diseases, diabetes, or Crohn's disease in human medicine) are complex traits meaning that these phenotypes are influenced by multiple, often interacting, environmental and genetic risk factors. Identifying the genes influencing complex traits has become (e.g., Hirschhorn and Daly, 2005) possible thanks to recent advances in marker genotyping technology and whole genome association studies. The rate-limiting step underlying application of these techniques is the lack of in the appropriate statistical methods for analysis. Developing these methods building upon recent developments in the fields of supervised learning, graphical probabilistic methods, and independent component analysis will be the topic of this WP. This WP is highly connected with WP3.1 (gene prioritization).
Since the introduction of microarrays for expression profiling, microarray technology has known an explosive growth. Beyond the refinement of expression microarrays (e.g., from cDNA arrays to short and long nucleotide microarrays), microarrays have also expanded into many other applications. Most notable are Chromatin ImmunoPrecipitation chips for the detection of transcription factor binding to genomic DNA, tiling arrays that measure expression beyond classical protein coding sequences, SNP chips for genotyping, and Comparative Genomic Hybridization arrays for the detection of chromosomal aberrations in congenital anomalies and cancer. All these microarray technologies require careful development of optimal data analysis procedures, which is an objective of this workpackage.
High-throughput proteomics is one of the most active research fields in the context of systems biology. However, high throughput technologies such as proteomics mass-spectrometry (e.g., by surface-enhanced laser desorbtion/ionization - time of flight (SELDI-TOF)) introduce artifacts which complicate data interpretation and exploitation (for example, because of sensitivity to instrument calibration and sample preparation protocols). To take full advantage of these high-throughput proteomics approaches, a significant further research effort is required to adapt and enhance existing bioinformatics methods.
High-throughput methods are an appropriate starting point to identify genes or proteins that belong to a pathway, process, or disease of interest - for example, in the analysis of regulatory networks or in medical applications. However, these methods often have a high number of false positives and false negatives. Data integration can mitigate these difficulties by overlapping different data types to eliminate inconsistencies. The research teams will in particular develop gene prioritization methods to integrate multiple data sources to identify the best candidate proteins or genes associated with disease or biological processes of interest. They will also develop methods to simultaneously analyze coupled systems biology data sets, at the level of the transcriptome, proteome, metabolome, or interactome with the goal of identifying candidate components of biological networks.
Present high-throughput technologies and genetic studies (see WP2.1) often fail to deliver immediate biological clues that contribute to an improved understanding of biological systems. Instead, researchers are confronted with large lists of candidate genes from genetic, genomic, transcriptomic, or proteomic approaches that require further filtering and validation. This further processing involves extensive and cumbersome manual browsing of the literature and of many databases. There is thus a need for a computational methodology to prioritize candidate genes based on multiple, heterogeneous data sources.
Top-down inference aims at reconstructing from high throughput data structural networks of interacting genetic entities. This data demanding approach is, given the current data availability, often underdetermined. Moreover, most efforts in integrative modeling focus on reconstructing network structures at one single molecular level at the time (either the transcriptional, protein interaction or metabolic network). Reconstructing a comprehensive network model, which includes all molecular levels and their interactions, is however an essential first step towards building a complete dynamical pathway model (bottom-up systems biology, WP.4). In this WP we will formalize the problem and devise algorithms to cope with the intricacies of the separate data sources.
A first goal will be to analyze and compare the use of deterministic vs. stochastic models as well as continuous time vs. discrete time ones for the modeling and simulation of biological networks. Particular focus will be on regulatory networks underlying circadian rhythms. A complementary focus will be the link between the topology of regulatory networks and their dynamical properties, as well as metabolic pathway inference methods. Threshold behavior and oscillatory phenomena in metabolic networks involving phosphorylation-dephosphorylation cascades will be investigated, with a focus on the network of cycling-dependent kinases controlling the cell cycle.
Complex behaviors are associated with biological systems because of their inherent nonlinearity and include transitions from stable to unstable steady states; or from simple to complex modes of oscillatory behavior. Understanding these behaviors is a central question of systems biology but remains a notoriously difficult mathematical question. In this subworkpackage, we will tackle one of the main current issues in modeling, which is whether deterministic models remain valid when the numbers of molecules involved are small, as may occur in cellular conditions. Indeed, in the presence of small amounts of mRNA or protein molecules, the effect of molecular noise on circadian rhythms may become significant and may compromise the emergence of coherent periodic oscillations.
A key challenge in the area of biological system modeling is to model the regulatory mechanisms that produce cellular rhythms. The interplay between a large number of variables coupled through multiple regulatory interactions makes it difficult to fully grasp the dynamics of oscillatory behavior without resorting to modeling and computer simulations. Our aim is to develop detailed molecular models for the regulatory networks that control (1) circadian rhythms and (2) the cell cycle clock. We will also investigate the coupling between these two networks and its potential effect on anticancer drugs.
There is a deep mismatch between the analyses of high-throughput data and the quantitative or qualitative modeling of cellular processes. Even with proper data integration, the gap remains large, also because precise modeling is only feasible for well-characterized systems. Little research has been done on methods to bridge this gap. A significant objective of the project will be to progress in this context, and all partners will focus on developing methods towards this goal. Several directions will be envisaged towards this challenging objective, among them integrated refinement procedures, where the current model is tightly matched to available genomic and high-throughput data to detect candidates for extension of the model .