|
ReClustering |
|
|
|
Reclustering expression profiles with SVMsPeople: Bart Hamers, Kathleen Marchal, Janick Mathys, Gert Thijs
Quality-based clustering gives rise to a number of clusters that warrant a certain quality criterion i.e. the objects (genes) have very similar expression profiles (1). Such clusters form ideal seeds for further analysis (motif finding, sequence annotation) as the quality characteristic of the cluster minimizes the number of false positives (i.e. non co-regulated genes that from biological point of view do not belong to the cluster). Especially for noise-sensitive methods such as motif finding, the reduced number of false positives enhances the performance of the method.The drawback of generating these sets of tightly coexpressed genes is the loss of information. Indeed depending on the stringency of the quality criterion, quality-based clustering rejects a number of real positives (genes that have a slightly different expression profile (due to experimental error?) but from biological perspective belong to the cluster). Therefore, after an initial cluster phase, the original high-quality cluster seeds need to be extended. To this perspective, an integrated method based on the combination of classification methods (support vector machine) (2) and motif finding (Motif Sampler, 3) is being developed. For each seed cluster (i.e. small cluster of tightly coexpressed genes), the promoter regions are searched for the presence of common regulatory elements. If retrieved, the presence of the motif corroborates the validity of the cluster since it points towards transcriptional coregulation of the genes belonging to the cluster of interest. Based on the seed genes that contain a motif a positive training set is constructed. The negative training set consists of an assembly of genes belonging to seed clusters other than the cluster of interest. These negative samples were verified not to contain the motif. These training sets are used to train a SVM. The trained SVM subsequently classifies all genes of the dataset. The input of the SVM are the vectors of expression profiles of the genes. In this way, the SVM will add to the seed cluster new genes of the dataset, potentially belonging to the same cluster. The output of the SVM is validated by a new round of motif finding. The new members of the cluster are scored for the presence of the motif. If the motif is present these genes will be used to extend the positive training set. The negative training set is similarly extended with negative examples. By iterating between SVM and motif finding, the original spherical seed clusters can be extended in a way that correspond to the geometrical distribution of the cluster and the biological relevance.
|
|
|
Copyright © 1998 Katholieke Universiteit Leuven Design: Gert Thijs Last update: 2001/03/13 |
|