PGE Manual

From BioiWiki

Jump to: navigation, search


Image:Pge logo 100.png

Positional Gene Enrichment



Contents

Abstract

The search for feature enrichment is now a widely used method to characterize a set of genes or proteins. While several tools have been designed and are available for nominal features such as the Gene Ontology annotations or the KEGG Pathways, very little has been proposed to tackle numerical features such as the gene coordinates on the chromosomes. For instance, microarray studies typically generate lists of genes that are differentially expressed in the sample subgroups under investigation, and when studying diseases caused by genome alterations such as cancer, it is of great interest to delineate the chromosomal regions that are significantly enriched in differentially expressed genes. In this article, we present a positional gene enrichment analysis method for the identification of chromosomal regions that are significantly enriched in a given set of genes. The strength of our method relies (i) on an original query optimization approach that allows to virtually consider all the possible chromosomal regions for enrichment, and (ii) on the correction applied in the context of multiple testing which discriminates truly enriched regions versus those that can occur by chance. We have developed a Web tool, PGE (positional gene enrichment), implementing this method applied to the human genome. PGE allows one to submit a set of gene identifiers and visualize enriched regions. For validation, we used PGE on published lists of differentially expressed genes observed in B-cell chronic lymphocytic leukemia, neuroblastoma tumors and tissues of Down syndrome patients. These analyses showed significant overrepresentation of known aberrant chromosomal regions.

Query Form

Image:Pge query.png

Entering a set of gene IDs, probeset IDs...

To search for regions enriched in a set of genes, the first thing to do is to provide the genes of interest. This can be done, either by copying/pasting the gene identifiers in the text box under Paste your IDs here or by uploading a (raw text) file containing the gene identifiers (only identifiers i.e. no description or other text).

The gene identifiers should be of the same type which means you cannot mix Ensembl IDs with Symbols for example. There is no particular format: IDs must be separated by white space characters (space, tabulation, new line) and are case insensitive.

Once the genes of interest provided, you should specify the type of identifiers you submit (see next section).

Selecting the reference dataset and the mapping of IDs

We do not automatically detect the kind of gene identifiers you submit, thus you have to specify it so we can retrieve their location in the genome. Currently, the following identifiers can be used:

  • Ensembl
  • Symbols
  • RefSeq (DNA or peptide)
  • Entrez
  • probe set IDs from Affymetrix (chip hgu133a and hgu95av2)

An additional parameter must be specified when submitting probe set IDs because multiple probe sets can map to the same gene. Currently, you can choose to map the probe set IDs to Ensembl IDs or Symbols. The mapping is performed by using the Affymetrix Human Genome Array Plate Set annotation files provided by Affymetrix.

Restricting the search to a particular chromosome

As you might be specifically interested to check the enrichment in a specific chromosome, you have the possibility to only calculate enrichment in the chromosome of interest.

Bear in mind that when you using this option, the results will be different from what you would obtained by searching all chromosomes. This is because of the statistics used to "measure" the enrichment of a particular region. This statistic (namely the hypergeometric distribution) depends on the number of genes in the genome, thus, when restricting the search to a particular chromosome, the total number of genes on that chromosome is different from the total number of genes in the genome, and thus, the outcome is different. For example, if you submit a query set of genes of interest all consisting of all the genes located on chromosome 21, then by searching all chromosome, the whole chromosome 21 will be enriched, while if you restrict the search to chromosome 21, you won't get any enriched region.

Choosing a multiple testing adjustment method

By default the "minPi" method is selected. It stands for the minimum p-values cumulative distribution function and provides the probability of obtaining a p-value at least as good (lower or equal) by chance i.e. by submitting a random set of the same size. This is the preferred method because it allows discriminating truly enriched region versus those that could occur by chance. Unfortunately, it is practically impossible to model this distribution, so we approximate it by sampling to obtain an empirical function. This is done by performing simulations and because of the computational cost (time complexity in O(q2) with q the number of genes submitted) of these simulations (submitting 1 query set actually implies performing 501 queries) the minPi method is available only for query set of 500 genes or less. Otherwise, the False Discovery Rate is applied.

Results tabs

Image:Pge results.png
Different tabs are available for the results:

  • Query: it allows to review the set of identifiers submitted.
  • Unmapped IDs: this tab lists the IDs submitted that could not be mapped against the reference dataset.
  • Raw results: this tabs gives the raw text output with the following format: 1 region per line, then for each region, fields are separated by tabulations (chromosome name, start position in base pairs, end position, p-value, adjusted p-value, number of genes of interest in that region, number of genes in that region)
  • BED: results in BED format so you can display those in Ensembl Genome Browser, more info at http://www.ensembl.org/Homo_sapiens/karyoview
  • Chromosome view: a graphical display of the results. Significantly enriched regions are displayed in blue. The deepness of the blue reflect the p-value significance. The regions can be plotted according to p-values (-10log p-value) or by percentage of genes of interest in the region (% enriched). The width and the scale (height) of the drawings can be tuned.

Mapping of gene identifiers

  • Ensembl, Symbols, Entrez, RefSef: To map those identifiers to their chromosome locations, we use the Ensembl 42 release.
  • Affymetrix probesets IDs: These identifiers are mapped via Affymetrix Genome plate set annotation files available from here. We use those files to map probeset ids to Ensembl or Symbol identifiers and genome location. Probeset ids that maps to more than one location or Ensembl ID (or symbol) are discarded.