CHD:Gene hunting example
 Prioritization of candidate genes for CHD loci identified by array CGH
Fifteen well-delineated genome regions that were found to be imbalanced in CHD patients were selected for further analysis: 10 from a previously published study (Thienpont et al., 2007) and 5 imbalances detected during a follow-up study. Seven regions that contain a known CHD gene served as positive controls (Figure 1). Of the remaining 8 regions, we excluded 2 patients carrying multiple chromosomal imbalances from further analysis, while the 6 other chromosomal imbalances were used to identify novel putative CHD genes. This strategy has previously been successful, for example, in studies of the recurrent 22q11 deletion that led to the identification of TBX1 mutations causing CHDs and DiGeorge syndrome (Yagi et al., 2003). The same example also demonstrates the challenges associated with this strategy, since many other candidate genes within the deletion interval had to be investigated in model organisms before identifying TBX1 â€“ thus requiring much time and resources for the identification (Scambler et al., 2000). Therefore, to enhance candidate gene selection in our regions, we applied the gene prioritization strategy (Endeavour) which is incorporated in CHDWiki.
 Screening of candidate genes for CHD by expression analysis in zebrafish development
The expression was analyzed by a standard method of colorimetric whole-mount in situ hybridizations using DIG-labeled antisense RNA probes. RNA probes were produced by in vitro transcription from Sp6 or T7-tagged DNA that generated by PCR amplification from a cDNA pool. This pool was reverse transcribed from an RNA mix that was extracted using TRIzol (Invitrogen, Paisley, UK) from zebrafish embryos at the 6 different defined developmental stages (12, 18, 22, 30, 36 and 48 hpf). Nested PCR was applied when PCR using a single primer pair was unable to produce a single product (external primers, table 2). When no product was obtained after nested PCR, we assumed that the transcript was not present in the cDNA pool, and that the gene was therefore not expressed at the relevant stages. The 6 known CHD genes and 25 high-ranking genes selected from the 6 regions containing no known CHD gene were further analyzed in zebrafish (Figures 1 and 2). Their expression in the developing zebrafish heart was evaluated at 3 to up to 6 different developmental stages that are key for cardiac development: 12 to 18 hours post fertilization (hpf): heart cell specification and initiation of migration to the midline, 22hpf: at the midline, heart cone stage, 30 to 36 hpf: heart elongation and looping, initiation of chamber specification and valve formation, 48 hpf: valve formation. The expression was analyzed by a standard method of colorimetric whole-mount in situ hybridizations using DIG-labeled antisense RNA probes. RNA Probes were produced by in vitro transcription from Sp6 or T7-tagged DNA that was PCR amplified from a cDNA pool. This pool was reverse transcribed from an RNA mix that was extracted using TRIzol (Invitrogen, Paisley, UK) from zebrafish embryos at the 6 different defined developmental stages (12, 18, 22, 30, 36 and 48 hpf). Nested PCR was applied when PCR using a single primer pair was unable to produce a single product. Primer pairs are listed in supplementary table 2. PCR products were sequenced to verify if truthful probes were generated. When no product was obtained after nested PCR, we assumed that the transcript was not present in the cDNA pool and that the gene was therefore not expressed at the relevant stages. TBX1 expression in the developing zebrafish has often been described (Stalmans et al., 2003; Zhang et al., 2006), so that we did not reanalyze it in this study. All experiments were carried out using the wild-type AB zebrafish line.
 Prioritisation of CHD candidate genes
First, all genes from the selected regions were prioritized. In 6 regions containing a known CHD gene, the gene prioritization tool readily ranked at least one known CHD gene first among all other imbalanced genes (Figure 1). In a second step, genes affected by the six selected imbalances not containing known CHD genes were prioritized in an attempt to find novel CHD genes For each patient we analysed whether a known CHD gene was affected by his imbalance, explaining the CHD observed in the patient. If so, the patient was added to the group that serves as a positive control (Figure 1), else he was added to the group where we attempted gene hunting (figure 2).
Imbalances in patients from figure 2 were further analysed using a more sophisticated Endeavour algorithm than what was recently published: since different developmental mechanisms contribute to cardiac development, a training set was constructed for each putative developmental mechanism involved in cardiac development: left-right asymmetry establishment (LR), vascularisation, the primary and secondary heart field and neural crest cells. A dataset containing genes shown to cause CHDs upon haplo-insufficiency (both syndromic and non-syndromic, as of August 2006) was constructed to enrich our results for dosage sensitive genes implicated in human cardiac defects and to enable scoring of developmental pathways that are not explicitly annotated in CHDs. A dataset containing all genes was assembled to accommodate potentially undiscovered crosstalk between the different developmental processes that could generate a significant signal.
As a negative control, we also constructed training sets for processes that are unrelated to cardiac development: Parkinson disease, atherosclerosis and inflammatory bowels disease. All datasets are available here.
Candidate genes were scored using all constructed training sets except for the LR set, which was only used when defects suggestive of LR abnormalities were reported in patients with similar imbalances or no patients with similar imbalances were described (del22q12.2 and delXp22).
Using leave-one-out cross validations (LOOCVs), receiver-operator characteristics (ROC) curves were constructed. The area under the ROC curves (AUC) was measured to enable assessment of the performance of each training set (performance = 1 - AUC of ROC). An AUC around 0.5 is equivalent to random ranking of candidate genes, while an AUC verging upon 1 is equivalent to first-place rankings for most training set genes.
We arbitrarily set a ROC AUC threshold of 0.6 for data sources to eliminate data sources that do not contain (sufficient) information and contribute merely noise to the predictions.
Extra data sources were added and the algorithm structure was adapted, as described in a separate manuscript.
We demonstrated that all training sets were able to efficiently prioritize genes known to be involved in the related developmental field or process. We also calculated the median of the p-values of the left-out genes from the LOOCV (supplementary table 1).
| Primary heart field
| Secondary heart field
| Dosage-sensitive CHD genes
| Cardiac Neural crest cells
| Left-right asymmetry establishment
| All genes in the above sets
A mixed set of criteria was used to select genes for further analysis:
- From each candidate list, the 2 highest ranking genes from the overall prioritization were selected.
- If the estimated penetrance of CHDs in similar aberrations was high (>25% or >50%), respectively the 3rd or 3rd and 4th ranking genes from the overall prioritization were selected
- If the p-value for the highest ranking gene in any prioritisations usind an individual training sets was higher than the median p-value of the LOOCV of this training set (supplementary table 1), this first-ranked gene was also selected.
In this way, we obtained a list of 25 genes. From this list, we eliminated 1 gene (TP73) because it also ranked in the top 10% in every prioritization that was based on the negative control training set. We moreover eliminated 3 genes that were shown not to cause CHDs upon mutation in at least 2 independent reports (CHEK2, RS1 and CDKL5).
In order to further analyze these rankings we investigated the expression of all genes in the developing zebrafish heart.
|Region||HGNC name||Human Ensembl ID|| Zebrafish Ensembl ID
||forward||reverse||external forward||external reverse||probe|
|14q22q23.1||BMP4||ENSG00000125378||ENSDARG00000019995|| a probe was already available
|17q25.3||no ID||ENSG00000187603||NA||GGCCGCGGtgtgcctggttattccgaac||CATTTAGGTGACACTATAGAAgcattcactggttatgctggt||atctgatcgcaaggaggttg||cgtcagcctttccctgatac|| NO|
 Expression screening of highly ranked CHD candidate genes
We next set up a gene expression screen in developing zebrafish, hypothesizing that genes expressed in the developing zebrafish heart are more likely involved in human heart developmental disorders. Analyzing the expression of the 8 aforementioned known CHD genes as a positive control demonstrated that three of them, NKX2-5, TBX1, and NOTCH1 have a specific expression in the developing zebrafish heart (i.e., a restricted expression pattern that includes the developing heart). EHMT1, ATRX, CREBBP, NSD1, and FBN2 are genes that cause syndromic CHDs and appeared to be expressed throughout the developing zebrafish without a specific expression restricted to the developing zebrafish heart. This pattern is consistent with the multisystem involvement of the syndromes caused by mutations in these genes. Nevertheless, genes that cause syndromic CHDs when mutated (such as TBX1, which causes DiGeorge syndrome) can have specific expression in the developing heart. In 4 out of the 6 regions, no gene showed specific expression in the developing heart, while some showed ubiquitous expression. This suggests that the unknown gene is part of the syndromic subset: they are causing CHDs as well as the other abnormalities that cause the syndromic phenotype observed in the patients. In the 2 remaining regions, a single gene (the highest-ranking) was found to be specifically expressed in the developing heart. As illustrated on its CHDWiki gene page, HAND2 in 4q34.1 is expressed at 12 hours post-fertilization (hpf) in the lateral plate mesoderm where the cardiac cells become specified, and later in development (22-48hpf) it is expressed in the primitive heart and the pharyngeal arches that give rise to the outflow tract and large vessels in humans. BMP4 in 14q22 is specifically expressed at 18hpf in the specified cardiac cells (Figure 2) as they migrate towards the midline, later in the developing heart and at 36hpf at the regions where valve development initiates.
In the CHDWiki, advanced analysis capabilities were exploited for gene hunting: genes located in indels found in CHD patients (Figure 1 and Figure 2) were prioritized using the integrated prioritization tool. Genes linked to CHDs and affected by an indel were effectively prioritized. For regions not containing known CHD genes, high ranking genes were further analyzed for their expression in developing zebrafish embryos. Amongst those genes, BMP4 and HAND2 showed a specific expression in the developing zebrafish heart, suggesting that reduced dosage of these genes causes the CHD observed in the patients. Additionally, both genes are known to be required for cardiac development in a dosage-dependent manner in mice (Goldman et al., 2006; McFadden et al., 2005). Figure 1:patients carrying imbalances encompassing known CHD genes and expression these CHD genes in the developing zebrafish heart Figure 2: patients carrying imbalances not encompassing known CHD genes and expression of high-ranked genes in the developing zebrafish heart