Calculation of the higher order background models |
---|
Calculation of the higher order background models
All GenBank files corresponding to the listed accessionnumbers where downloaded. Intergenic regions were delineated according to the modules of INCLUSive (Thijs et al., 2002). Based on these intergenics higher order background models were calculated. Therefore, the frequency of all oligonucleotides of order m and m+1 are counted to generated resp. the frequency table of all oligonucleotides and the transition matrix of the background model. The transition matrix stores the probability of a nucleotide given the m previous nucleotides in the sequence. The following table shows part of the third-order background model based on the intergenic sequences in Escherichia coli (K12). The first element in the matrix thus represents the probability of finding an A in the intergenics of E.coli given the fact that there are three preceding A's.
Content | A | C | G | T |
---|---|---|---|---|
AAA | 0.3788 | 0.1715 | 0.1655 | 0.2842 |
AAC | 0.3530 | 0.2052 | 0.2405 | 0.2014 |
AAG | 0.2678 | 0.2447 | 0.2461 | 0.2414 |
AAT | 0.3142 | 0.2007 | 0.2009 | 0.2842 |
ACA | 0.3625 | 0.1596 | 0.2208 | 0.2571 |
... | ... | ... | ... | ... |
TTC | 0.3325 | 0.2136 | 0.1680 | 0.2859 |
TTG | 0.2333 | 0.3207 | 0.1180 | 0.3279 |
TTT | 0.2301 | 0.1907 | 0.1801 | 0.3992 |
Overview of the computed models
The table below gives the accession numbers for which a background model was calculated, the description of the accession number and the single nucleotide frequency of the intergenic sequences. Remark at first that for Vibrio cholerae, Brucella melitensis and Deinococcus radiodurans, the intergenic sequences of both chromosomes were combined to calculate a single genome background model. Secondly, it should be noted that for genomes and plasmids separate background models were calculated.
Accession Number | Description of the organism | Single Nucleotide Frequency A C G T | |||
---|---|---|---|---|---|
NC_000854 | Aeropyrum pernix | 0.241861 | 0.255022 | 0.260723 | 0.242394 |
NC_002147 | Agrobacterium tumefaciens (plasmid pTi-SAKURA) | 0.240459 | 0.259457 | 0.26361 | 0.236475 |
NC_003306 | Agrobacterium tumefaciens (C58 Dupont plasmid AT) | 0.245357 | 0.258775 | 0.261389 | 0.234479 |
NC_003304 | Agrobacterium tumefaciens (C58 U.Wash circular chromosome) | 0.242733 | 0.26016 | 0.26263 | 0.234478 |
NC_003305 | Agrobacterium tumefaciens (C58 U.Wash linear chromosome) | 0.243302 | 0.258823 | 0.261405 | 0.23647 |
NC_003062 | Agrobacterium tumefaciens (C58 circular chromosome) | 0.243589 | 0.259362 | 0.258827 | 0.238222 |
NC_003063 | Agrobacterium tumefaciens (C58 linear chromosome) | 0.245508 | 0.259408 | 0.257901 | 0.237183 |
NC_003064 | Agrobacterium tumefaciens (C58 plasmid AT) | 0.243119 | 0.259851 | 0.262967 | 0.234063 |
NC_003065 | Agrobacterium tumefaciens (C58 plasmid Ti) | 0.263051 | 0.239196 | 0.241812 | 0.255941 |
NC_000918 | Aquifex aeolicus | 0.30665 | 0.194991 | 0.204513 | 0.293847 |
NC_000917 | Archaeoglobus fulgidus | 0.317611 | 0.180168 | 0.187041 | 0.315179 |
NC_003995 | Bacillus anthracis (A2012) | 0.359211 | 0.131324 | 0.169541 | 0.339925 |
NC_001496 | Bacillus anthracis (virulence plasmid PX01) | 0.361732 | 0.141994 | 0.161688 | 0.334586 |
NC_002570 | Bacillus halodurans | 0.319638 | 0.172334 | 0.207455 | 0.300574 |
NC_000964 | Bacillus subtilis | 0.328008 | 0.167476 | 0.193845 | 0.310671 |
NC_003278 | Bacteriophage phi CTX | 0.190154 | 0.339068 | 0.308041 | 0.162737 |
NC_001318 | Borrelia burgdorferi | 0.395511 | 0.089595 | 0.11843 | 0.396465 |
NC_002528 | Buchnera sp. (APS) | 0.4282 | 0.077825 | 0.081557 | 0.412418 |
NC_002163 | Campylobacter jejuni | 0.396611 | 0.102881 | 0.1212 | 0.379308 |
NC_002696 | Caulobacter crescentus | 0.193431 | 0.30843 | 0.30567 | 0.192469 |
NC_002620 | Chlamydia muridarum | 0.321674 | 0.166279 | 0.177119 | 0.334928 |
NC_002179 | Chlamydia pneumoniae (AR39) | 0.333693 | 0.160117 | 0.160409 | 0.345781 |
NC_000117 | Chlamydia trachomatis | 0.313897 | 0.172219 | 0.183978 | 0.329906 |
NC_000922 | Chlamydophila pneumoniae (CWL029) | 0.328908 | 0.165686 | 0.167276 | 0.33813 |
NC_003030 | Clostridium acetobutylicum (ATCC824) | 0.382556 | 0.109831 | 0.153613 | 0.354 |
NC_003366 | Clostridium perfringens | 0.408966 | 0.080243 | 0.124114 | 0.386676 |
NC_001895 | Enterobacteria phage (P2) | 0.296663 | 0.220681 | 0.21143 | 0.271226 |
NC_002371 | Enterobacteria phage (P22) | 0.298211 | 0.217804 | 0.217362 | 0.266622 |
NC_000913 | Escherichia coli (K12) | 0.295326 | 0.205887 | 0.20227 | 0.296517 |
NC_002655 | Escherichia coli (O157:H7 EDL933) | 0.295658 | 0.204987 | 0.206437 | 0.292918 |
NC_002695 | Escherichia coli (O157:H7) | 0.295608 | 0.204037 | 0.20508 | 0.295275 |
NC_002142 | Escherichia coli (plasmid pB171) | 0.284904 | 0.217931 | 0.219671 | 0.277494 |
NC_002525 | Escherichia coli (plasmid R721) | 0.291134 | 0.209497 | 0.21193 | 0.287439 |
NC_003295 | Ralstonia solanacearum | 0.255556 | 0.246667 | 0.24 | 0.257778 |
NC_000907 | Haemophilus influenzae (Rd) | 0.353833 | 0.148793 | 0.153898 | 0.343475 |
NC_002607 | Halobacterium sp. (NRC-1) | 0.188925 | 0.313247 | 0.310605 | 0.187223 |
NC_000915 | Helicobacter pylori (26695) | 0.3593 | 0.137307 | 0.154089 | 0.349304 |
NC_000921 | Helicobacter pylori (J99) | 0.351909 | 0.148488 | 0.162548 | 0.337055 |
NC_002137 | Lactococcus cremoris (plasmid pNZ4000) | 0.355004 | 0.145747 | 0.173055 | 0.326193 |
NC_002662 | Lactococcus lactis | 0.378586 | 0.124909 | 0.160222 | 0.336283 |
NC_003212 | Listeria innocua (Clip11262) | 0.351615 | 0.142858 | 0.176998 | 0.328529 |
NC_003210 | Listeria monocytogenes (EGD) | 0.346104 | 0.147406 | 0.177076 | 0.329414 |
NC_002682 | Mesorhizobium loti (plasmid pMLb) | 0.214231 | 0.284752 | 0.291247 | 0.20977 |
NC_002678 | Mesorhizobium loti | 0.218228 | 0.284727 | 0.285867 | 0.211177 |
NC_000916 | Methanobacterium thermoautotrophicum (delta H) | 0.319494 | 0.18352 | 0.19296 | 0.304026 |
NC_000909 | Methanococcus jannaschii | 0.383845 | 0.11403 | 0.129283 | 0.372843 |
NC_003551 | Methanopyrus kandleri (AV19) | 0.200434 | 0.297155 | 0.307352 | 0.195059 |
NC_003552 | Methanosarcina acetivorans (C2A) | 0.333612 | 0.169457 | 0.174811 | 0.322119 |
NC_003901 | Methanosarcina mazei (Goe1) | 0.335876 | 0.164747 | 0.16769 | 0.331688 |
NC_002677 | Mycobacterium leprae (TN) | 0.187086 | 0.27649 | 0.321192 | 0.215232 |
NC_002755 | Mycobacterium tuberculosis (CDC1551) | 0.196068 | 0.302233 | 0.31504 | 0.186659 |
NC_000962 | Mycobacterium tuberculosis (H37Rv) | 0.196432 | 0.304403 | 0.313149 | 0.186016 |
NC_000908 | Mycoplasma genitalium | 0.367269 | 0.141328 | 0.14336 | 0.348044 |
NC_000912 | Mycoplasma pneumoniae | 0.336135 | 0.168198 | 0.168528 | 0.327139 |
NC_002771 | Mycoplasma pulmonis | 0.422614 | 0.08577 | 0.093849 | 0.397767 |
NC_003116 | Neisseria meningitidis (serogroup A strain Z2491) | 0.287045 | 0.224549 | 0.220981 | 0.267426 |
NC_003112 | Neisseria meningitidis (serogroup B strain MC58) | 0.293869 | 0.220158 | 0.215051 | 0.270922 |
NC_003272 | Nostoc sp. (PCC 7120) | 0.328164 | 0.176426 | 0.176844 | 0.318566 |
NC_002663 | Pasteurella multocida | 0.332998 | 0.163356 | 0.172938 | 0.330708 |
NC_002122 | Plasmid ColIb-P9 | 0.263065 | 0.244201 | 0.23187 | 0.260863 |
NC_002483 | Plasmid F | 0.293327 | 0.199083 | 0.210064 | 0.297526 |
NC_002134 | Plasmid R100 | 0.28753 | 0.217654 | 0.223334 | 0.271481 |
NC_002516 | Pseudomonas aeruginosa | 0.203574 | 0.30486 | 0.293744 | 0.197822 |
NC_003350 | Pseudomonas putida (plasmid pWW0) | 0.221135 | 0.283861 | 0.288276 | 0.206728 |
NC_003364 | Pyrobaculum aerophilum | 0.277496 | 0.220089 | 0.2264 | 0.276015 |
NC_003296 | Pyrobaculum aerophilum | 0.277496 | 0.220089 | 0.2264 | 0.276015 |
NC_000868 | Pyrococcus abyssi | 0.303426 | 0.190595 | 0.203458 | 0.302522 |
NC_003413 | Pyrococcus furiosus (DSM 3638) | 0.31836 | 0.175281 | 0.188468 | 0.317891 |
NC_000961 | Pyrococcus horikoshii | 0.318463 | 0.175196 | 0.189504 | 0.316837 |
NC_003103 | Rickettsia conorii (Malish 7) | 0.369363 | 0.138443 | 0.148575 | 0.343619 |
NC_000963 | Rickettsia prowazekii (Madrid E) | 0.385694 | 0.112105 | 0.120806 | 0.381395 |
NC_002638 | Salmonella Choleraesuis (50k virulence plasmid) | 0.269815 | 0.220568 | 0.224913 | 0.284704 |
NC_003384 | Salmonella Typhi (plasmid pHCM1) | 0.290802 | 0.202909 | 0.21092 | 0.295369 |
NC_003385 | Salmonella Typhi (plasmid pHCM2) | 0.301451 | 0.210195 | 0.20977 | 0.278584 |
NC_003198 | Salmonella Typhi | 0.289873 | 0.208486 | 0.210412 | 0.291228 |
NC_002305 | Salmonella typhi (plasmid R27) | 0.292833 | 0.200335 | 0.210706 | 0.296126 |
NC_003277 | Salmonella typhimurium (LT2 plasmid pSLT) | 0.250919 | 0.24901 | 0.251732 | 0.248339 |
NC_003197 | Salmonella typhimurium (LT2) | 0.290432 | 0.20943 | 0.209792 | 0.290345 |
NC_002698 | Shigella flexneri (virulence plasmid pWR501) | 0.299489 | 0.206613 | 0.214861 | 0.279037 |
NC_003047 | Sinorhizobium meliloti (1021) | 0.221659 | 0.279783 | 0.284867 | 0.213692 |
NC_003037 | Sinorhizobium meliloti (plasmid pSymA) | 0.22263 | 0.279295 | 0.28446 | 0.213615 |
NC_003078 | Sinorhizobium meliloti (plasmid pSymB) | 0.224171 | 0.279466 | 0.285155 | 0.211207 |
NC_002758 | Staphylococcus aureus (Mu50) | 0.373193 | 0.121018 | 0.151865 | 0.353924 |
NC_002745 | Staphylococcus aureus (N315) | 0.373372 | 0.1203 | 0.149601 | 0.356727 |
NC_003098 | Streptococcus pneumoniae (R6) | 0.350542 | 0.144933 | 0.177469 | 0.327057 |
NC_003028 | Streptococcus pneumoniae (R6) | 0.350542 | 0.144933 | 0.177469 | 0.327057 |
NC_003485 | Streptococcus pyogenes (MGAS8232) | 0.343958 | 0.15266 | 0.187757 | 0.315626 |
NC_002737 | Streptococcus pyogenes | 0.341701 | 0.154167 | 0.186625 | 0.317507 |
NC_003888 | Streptomyces coelicolor (A32) | 0.158916 | 0.349382 | 0.341986 | 0.149716 |
NC_003106 | Sulfolobus tokodaii | 0.359505 | 0.138665 | 0.142451 | 0.359379 |
NC_000911 | Synechocystis sp. (PCC 6803) | 0.288694 | 0.210883 | 0.206234 | 0.294189 |
NC_003869 | Thermoanaerobacter tengcongensis (MB4T) | 0.348676 | 0.140259 | 0.196223 | 0.314841 |
NC_002689 | Thermoplasma volcanium | 0.343795 | 0.157582 | 0.160852 | 0.337771 |
NC_000853 | Thermotoga maritima | 0.311903 | 0.17974 | 0.213007 | 0.29535 |
NC_000919 | Treponema pallidum | 0.222977 | 0.236436 | 0.296936 | 0.243651 |
NC_002488 | Xylella fastidiosa (9a5c) | 0.267285 | 0.227061 | 0.226993 | 0.278661 |
NC_003131 | Yersinia pestis plasmid (pCD1) | 0.30731 | 0.187034 | 0.196966 | 0.30869 |
NC_003134 | Yersinia pestis plasmid (pMT1) | 0.288638 | 0.219575 | 0.215738 | 0.27605 |
NC_003143 | Yersinia pestis strain (CO92) | 0.300678 | 0.196472 | 0.199165 | 0.303685 |
Bmelitensis | Brucella melitensis (chr I and II) | 0.25082 | 0.24512 | 0.253735 | 0.250325 |
Vcholerae | Vibrio cholerae (chr I and II) | 0.296128 | 0.200417 | 0.203164 | 0.300292 |
NC_004041 | Rhizobium etli (symbiotic plasmid p42d) | 0.227587 | 0.272952 | 0.277418 | 0.222043 |
NC_003450 | Corynebacterium glutamicum | 0.26862 | 0.230742 | 0.235683 | 0.264956 |
Dradiodurans | Deinococcus radiodurans (R1) | 0.191602 | 0.313635 | 0.300836 | 0.193927 |
Statistics |
---|
Figures 1 compares the 3th order transition probabilities of two different genomes. In panel A the third-order transition matrices of E.coli and S.typhimurium are compared. The relatedness of both species is reflected by the almost identical distribution of the transition probabilities, which is represented by the scattering of the dots around the 1-1 line. In panel B the same plot is made but for two species with different GC content: E.coli K12 (At-rich) and S.coelicolor A32 (GC rich). From these plots it is clear that exchanging background models of E.coli and S.typhimurium will not deteriorate results obtained by the Motif Sampler. Exchanging background models between organisms with strongly different base pair composition will unmistakably result in an increase of false positives.
Figure 1.A | Figure 1.B |
---|---|
[View full size image] |
[View full size image] |
The tree in Figure2 shows the relatedness between organisms based on their oligonucleotide frequency of length 4 (3 order transition probabilities). To construct this tree, each genome was characterized by its transition probability vector of order 3. The Euclidean distance was used to calculate a pairwise distance between all 106 probability vectors. Using complete linkage clustering, a tree was constructed (TREECON, (Van de Peer and De Wachter, 1994)). For organisms or entries clustering together in this tree, background models can safely be exchanged (e.g. the base pair composition of E. coli strains and S. strains are almost identical). If, however, two organisms are too far away in the tree, the use of a distinct background model is essential (see also Fig B). This information can be useful for e.g. phylogenetic footprinting. If one searches a motif in the intergenics of orthologs of distinct organisms by using Motif Sampler, it is advisable to check the background distribution of these species.
Figure 2 |
---|
[View full size image] |
References |
---|