SkewDB: A comprehensive database of GC and 10 other skews for over 28,000 chromosomes and plasmids
Data files
Oct 04, 2021 version files 5.36 GB
Abstract
GC skew denotes the relative excess of G nucleotides over C nucleotides on the leading versus the lagging replication strand of eubacteria. While the effect is small, typically around 2.5%, it is robust and pervasive. GC skew and the analogous TA skew are a localized deviation from Chargaff’s second parity rule, which states that G and C, and T and A occur with (mostly) equal frequency even within a strand.
Most bacteria also show the analogous TA skew. Different phyla show different kinds of skew and differing relations between TA and GC skew.
This article introduces an open access database (https://skewdb.org) of GC and 10 other skews for over 28,000 chromosomes and plasmids. Further details like codon bias, strand bias, strand lengths and taxonomic data are also included.
The SkewDB database can be used to generate or verify hypotheses. Since the origins of both the second parity rule, as well as GC skew itself, are not yet satisfactorily explained, such a database may enhance our understanding of microbial DNA.
Methods
The SkewDB analysis relies exclusively on the tens of thousands of FASTA and GFF3 files available through the NCBI download service, which covers both GenBank and RefSeq. The database includes bacteria, archaea and their plasmids. Furthermore, to ease analysis, the NCBI Taxonomy database is sourced and merged so output data can quickly be related to (super)phyla or specific species. No other data is used, which greatly simplifies processing. Data is read directly in the compressed format provided by NCBI.
All results are emitted as standard CSV files. In the first step of the analysis, for each organism the FASTA sequence and the GFF3 annotation file are parsed. Every chromosome in the FASTA file is traversed from beginning to end, while a running total is kept for cumulative GC and TA skew. In addition, within protein coding genes, such totals are also kept separately for these skews on the first, second and third codon position. Furthermore, separate totals are kept for regions which do not code for proteins. In addition, to enable strand bias measurements, a cumulative count is maintained of nucleotides that are part of a positive or negative sense gene. The counter is increased for positive sense nucleotides, decreased for negative sense nucleotides, and left alone for non-genic regions.
A separate counter is kept for non-genic nucleotides. Finally, G and C nucleotides are counted, regardless of if they are part of a gene or not. These running totals are emitted at 4096 nucleotide intervals, a resolution suitable for determining skews and shifts. In addition, one line summaries are stored for each chromosome. These line includes the RefSeq identifier of the chromosome, the full name mentioned in the FASTA file, plus counts of A, C, G and T nucleotides. Finally five levels of taxonomic data are stored.
Chromosomes and plasmids of fewer than 100 thousand nucleotides are ignored, as these are too noisy to model faithfully. Plasmids are clearly marked in the database, enabling researchers to focus on chromosomes if so desired. Fitting Once the genomes have been summarised at 4096-nucleotide resolution, the skews are fitted to a simple model. The fits are based on four parameters. Alpha1 and alpha2 denote the relative excess of G over C on the leading and lagging strands. If alpha1 is 0.046, this means that for every 1000 nucleotides on the leading strand, the cumulative count of G excess increases by 46. The third parameter is div and it describes how the chromosome is divided over leading and lagging strands. If this number is 0.557, the leading replication strand is modeled to make up 55.7% of the chromosome. The final parameter is shift (the dotted vertical line), and denotes the offset of the origin of replication compared to the DNA FASTA file. This parameter has no biological meaning of itself, and is an artifact of the DNA assembly process.
The goodness-of-fit number consists of the root mean squared error of the fit, divided by the absolute mean skew. This latter correction is made to not penalize good fits for bacteria showing significant skew. GC skew tends to be defined very strongly, and it is therefore used to pick the div and shift parameters of the DNA sequence, which are then kept as a fixed constraint for all the other skews, which might not be present as clearly. The fitting process itself is a downhill simplex method optimization over the three dimensions, seeded with the average observed skew over the whole genome, and assuming there is no shift, and that the leading and lagging strands are evenly distributed. The simplex optimization is tuned so that it takes sufficiently large steps so it can reach the optimum even if some initial assumptions are off.
Usage notes
This is an archival copy of the SkewDB.
Details of this database can be found on:
- https://skewdb.org/
- https://berthub.eu/articles/posts/skewdb-an-open-database-of-gc-and-other-microbial-skews/
- https://doi.org/10.1101/2021.09.09.459602 ("SkewDB: A comprehensive database of GC and 10 other skews for over 28,000 chromosomes and plasmids")
In this document you'll find an abstract with a high-level description, an explanation of data sources & regeneration details, followed by a per-field description of the files in the distribution.
Abstract
GC skew denotes the relative excess of G nucleotides over C nucleotides on the leading versus the lagging replication strand of eubacteria. While the effect is small, typically around 2.5%, it is robust and pervasive. GC skew and the analogous TA skew are a localized deviation from Chargaff's second parity rule, which states that G and C, and T and A occur with (mostly) equalfrequency even within a strand. Most bacteria also show the analogous TA skew.
Different phyla show different kinds of skew and differing relations between TA and GC skew. This article introduces an open access database (https://skewdb.org) of GC and 10 other skews for over 28,000 chromosomes and plasmids.
Further details like codon bias, strand bias, strand lengths and taxonomic data are also included. The SkewDB database can be used to generate or verify hypotheses. Since the origins of both the second parity rule, as well as GC skew itself, are not yet satisfactorily explained, such a database may enhance our understanding of microbial DNA.
Sources & Regeneration
As explained more fully in the preprint, all data is sourced from the NCBI genome repository. No further data is required. This whole database can be recreated using the open source Antonie software and the 'repro.sh' script from https://github.com/berthubert/skewdb-articles/blob/master/repro.sh
Contents
This distribution contains the following files:
- gcskewdb.csv: one line per DNA sequence, containing a high-level description of skews and biases, plus phylogenetic data
- skplot.csv: raw skew data for all DNA sequences, at 4096-nucleotide intervals
- ...fit.csv: one file per DNA sequence, containing the data from skplot.csv, but also plotted fits of all the skews
gcskewdb has the following defined fields:
name Name of DNA sequence (symbolic, like NC_123234.1) fullname Full name of sequence, often including strain. Sourced from FASTA a/c/g/tcount Number of 'A/C/G/T' nucleotides in sequence plasmid Set to 1 if this is a plasmid realm1/2/3/4/5 Phylogenetic information at 5 levels protgenecount Total nucleotides found in coding regions stopTAG Number of stop codons that are TAG stopTAA Number of stop codons that are TAA stopTGA Number of stop codons that are TGA stopXXX Number of stop codons that are something else startATG Number of start codons that are ATG startGTG Number of start codons that are GTG startTTG Number of start codons that are TTG startXXX Number of start codons that are something else dnaApos Locus of the dnaA gene in the DNA sequence, -1 if not found dnaAsense Sense of the dnaA gene siz Size of DNA sequence in nucleotides gccount Equal to gcount+ccount ngcount Number of nucleotides outside of protein coding regions a/c/g/tcounts2 Number of A/C/G/T nucleotides in the final codon position alpha1gc GC excess ratio per nucleotide, leading strand alpha2gc CG excess ratio per nucleotide, lagging strand shift Position in DNA sequence there the leading strand starts div Relative length of the leading strand versus genome length alpha1/2ta AT/TA excess ratio per nucleotide, leading/lagging trand alpha1/2sb Excess ratio of coding nucleotides, leading/lagging strand alpha1gc0/1/2 Excess ratio of GC on 1st, 2nd, 3rd codon positions, leading strand alpha2gc0/1/2 Excess ratio of CG on 1st, 2nd, 3rd codon positions, lagging strand alpha1ta0/1/2 Excess ratio of TA on 1st, 2nd, 3rd codon positions, leading strand alpha2ta0/2/2 Excess ratio of AT on 1st, 2nd, 3rd codon positions, lagging strand alpha1gcNG Excess ratio of GC on non-protein coding nucleotides, leading strand alpha2gcNG Excess ratio of CG on non-protein coding nucleotides, lagging strand alpha1taNG Excess ratio of TA on non-protein coding nucleotides, lagging strand alpha2taNG Excess ratio of AT on non-protein coding nucleotides, lagging strand rmsGC,TA,SB Root mean squared error of fits rmsGC0/1/2 Root mean squared error of fits rmsTA0/1/2 Root mean squared error of fits rmsGC/TANG Root mean squared error of fits gccontent GC% of DNA sequence - equal to (gcount+ccount)/siz a/c/g/tfrac Fraction of nucleotides that are A, C, G or T leada/c/g/tfrac Fraction of leading strand coding nucleotides that are A, C, G or T laga/c/g/tfrac Fraction of lagging strand coding nucleotides that are A, C, G or T
For historical reasons some other fields are also present, these should not be used until they are defined here.
The raw, unmodelled, skews are available in skplot.csv, at 4096 nucleotide resolution, with the following fields:
name Name of this DNA sequence relpos Relative position in sequence abspos Absolute position in sequence gc/taskew Cumulative GC/TA skews, in raw nucleotides gcskew0/1/2 Cumulative GC skew on 1st, 2nd and 3rd codon positions of coding nucleotides gcskewNG Cumulative GC skew on non-protein coding nucleotides taskew0/1/2 Cumulative TA skew on 1st, 2nd and 3rd codon positions of coding nucleotides taskewNG Cumulative TA skew on non-protein coding nucleotides pospos Cumulative excess of positive sense genes (for strand bias) gccount Cumulative count of GC nucleotides ngcount Cumulative count of non-protein coding nucleotires a/c/g/tcounts0 Counts of A/C/G/T nucleotides on first codon position of protein coding nucleotides a/c/g/tcounts1 Counts of A/C/G/T nucleotides on second codon position of protein coding nucleotides a/c/g/tcounts2 Counts of A/C/G/T nucleotides on third codon position of protein coding nucleotides
Per DNA sequence, there is a fit.csv file. Its name corresponds to the 'name' field in gcskewdb.csv file. The .fit csv files contain the following fields:
pos Position in genome (as relative to the FASTA). Data is provided at 4096 nucleotide intervals. gcskew Cumulative GC skew up to this point predgcskew Predicted cumulative GC skew based on the fit, up to this point taskew Same, but for TA predtaskew Same, but for TA sbskew Same, but for Strand Bias predsbskew Same, but for Strand Bias gc0/1/2skew Same but for GC skew on 0/1/2 codon position predgc0/1/2skew Same but for GC skew on 0/1/2 codon position ta0/1/2skew Same but for TA skew on 0/1/2 codon position predta0/1/2skew Same but for TA skew on 0/1/2 codon position gcNGskew Same but for GC skew non non-protein coding nucleotides predgcNGskew Same but for GC skew non non-protein coding nucleotides taNGskew Same but for TA skew non non-protein coding nucleotides predtaNGskew Same but for TA skew non non-protein coding nucleotides predleading Set to 1 if this position (locus) is modelled to be on the leading strand