Dual randomly barcoded transposon sequencing (Dual Tn-seq) data for Streptococcus pneumoniae D39
Data files
Sep 17, 2025 version files 21.52 GB
-
README.md
6.20 KB
-
run1_filtered.tsv.gz
1.65 GB
-
run1_genepairs_min6.tsv.gz
29.60 MB
-
run2_filtered.tsv.gz
4.18 GB
-
run2_genepairs_min3.tsv.gz
32.18 MB
-
run3_filtered.tsv.gz
4.41 GB
-
run3_genepairs_min3.tsv.gz
32.98 MB
-
run4_filtered.tsv.gz
5.69 GB
-
run4_genepairs_min4.tsv.gz
32.08 MB
-
run5_filtered.tsv.gz
5.09 GB
-
run5_genepairs_min4.tsv.gz
32.45 MB
-
small.zip
102 MB
-
table_S1_dual_tnseq_dataset_full.xlsx
170.44 MB
-
tableS1.tsv.gz
67.30 MB
Abstract
Gene redundancy complicates systematic characterization of gene function as single-gene deletions may not produce discernible phenotypes. Dual-TnSeq (dual transposon sequencing) couples random barcode transposon site sequencing with the Cre-lox system to enable the characterization of >1 billion double mutant strains. This data set reports dual-tnseq data for Streptococcus pneumoniae D39 in rich media. Large libraries of mutants with two different transposons, carrying two different resistance markers and each with their own barcodes, were constructed and sequenced. Then, two libraries of mutants were combined by using transformation and selected on rich media; the Cre-lox system was induced to place the two barcodes into proximity; and the pairs of barcodes were sequenced. Pairs of genes with low rates of dual insertions often indicate synthetic lethality or a milder genetic interaction. The genetic interactions identified span a wide range of biochemical processes and reveal new factors in well-studied pathways, including a novel cytidine triphosphate synthase and an activator of cell wall biosynthesis.
https://doi.org/10.5061/dryad.7d7wm3840
Description of the data and file structure
The smaller files are all included within small.zip.
The table of adjusted statistics per pair of genes, when considering insertions in the central 10-90% of each gene, is in genepair_stats.tsv.gz. Note that these are unordered pairs (after combining the results for locusId1 disrupted in ML1 or ML3 and locusId2 disrupted in ML2, and vice versa), and are always shown with locusId1 < locusId2. Besides the two locus identifiers, the table reports the number of mutant strains for this pair of genes (nStrains), how many total reads these strains had (nReads), how many strains and reads were expected in the absence of a genetic interaction after adjusting for chromosomal location (expectStrainsAdj and expectReadsAdj), and the final summary statistics for each pair of genes (zStrains and readRatio). We consider pairs with zStrains <= -3 and readRatio <= 0.2 to be medium-confidence genetic interactions, and medium-confidence pairs with either zStrains <= -4 or readRatio <= 0.05 to be strong genetic interactions.
Genome and annotation of Streptococcus pneumoniae D39: the genome sequence is in genome.fna (fasta format). genes.tab and genes.GC are tab-delimited tables of genes (the .GC file has an additional field for the GC content). aaseq has the protein sequences (fasta format).
The RB-TnSeq mappings for ML1, ML2, and ML3 are in ML1comb, ML2comb, and ML3comb. In these tables, barcode is the barcode sequenced by a TnSeq-like protocol, rcbarcode is its reverse complement, n is the number of reads for the barcode mapping to its primary location, nTot is the total number of reads for the barcode, scaffold/strand/pos describe the primary location (that is, the insertion location that we believe the barcode corresponds to), n2 is how many reads correspond to the the secondary location (if any; a few reads at other locations may be due to chimeric PCR during TnSeq), scaffold2/strand2/pos2 describe the secondary location; and nPastEnd reports how many reads associate the barcode with intact vector instead of with an insertion in the genome. A tiny fraction of insertions have scaffold = "pastEnd" which means that the reads for this barcode continue from the barcode into the vector instead of into the genome.
The *.withgenes files (i.e., ML1comb.withgenes, etc.) have additional fields for which gene (locusId) the insertion lies within (if any) and what fraction of the way through the gene the insertion lies (the f field). See the *.stats files for library metrics.
For unique protein-coding genes (of at least 100 nt), a prediction as to whether they are essential or not is in esstable. Genes are assigned as essential (ess=TRUE) if they are unique (do not have a nearly-identical duplicate), the rate of TnSeq reads per nt is under 20% of the median, and the rate of insertion locations per nt is under 20% of the median.
Models for mapping the libraries are in modelRBlox.txt -- the first line has the expected structure of the read, and the second has the continuation of the read if it is from intact vector, instead of from an insertion in the genome. Each library was mapped 5-6 times and for each library, 2 of the runs used the "TnSeq3" protocol. For those mapping files, see modelTnSeq3RBlox*.
The results of the chimera test for Dual-TnSeq are in test*_pairs.tsv, i.e. test300_pairs.tsv for 300 ng, test500_pairs.tsv for 500 ng, and test1000_pairs.tsv for 1000 ng. These table show the number of reads (n) for each pair of barcodes.
These raw results for each big Dual-TnSeq run are provided as separate files: a table of all pairs of mapped insertions as well as a table of #strains and #reads per pair of genes (with rare potentially-chimeric pairs excluded from the latter table). For example, for run1, run1_filtered.tsv.gz has all pairs of mapped insertions, with fields for the two barcodes, the locations of the two insertions, and the number of reads for that pair of barcodes (n). run1_genepairs_min6.tsv.gz tabulates the results for each pairs of genes for the 10-90% analysis, with barcode pairs having less than 6 reads ignored. In that table, nStrains and nReads show the number of strains and reads for that pair of genes with the first library (ML1 or ML3) having an insertion in locusId1 and the second library (ML2) having an insertion in locusId2; totStrains1 is the total number of strains with the first library having an insertion in locusId1; and similarly for totReads1, totStrains2, and totReads2. revStrains and revReads give the number of strains or reads that have an insertion in locusId1 in the second library (ML2) and an insertion in locusId2 in the first library (ML1 or ML3).
Similarly, results for run2 are in run2_filtered.tsv.gz and run2_genepairs_min3.tsv.gz; results for run3 are in run3_filtered.tsv.gz and run3_genepairs_min3.tsv.gz; results for run4 are in run4_filtered.tsv.gz and run4_genepairs_min4.tsv.gz; and results for run5 are in run5_filtered.tsv.gz run5_genepairs_min4.tsv.gz.
The final scores for each pair of genes, when considering either the central 10-90% of each gene or all insertions within each gene (column names ending in "0-100%"), are provided in table_S1_dual_tnseq_dataset_full.xlsx. The same information is also available in a gzipped tab-delimited format (tableS1.tsv.gz). The 10-90% results are the same as in genepair_stats.tsv.gz.
Code/software
The RB-TnSeq mapping was conducted using the MapTnSeq.pl and DesignRandomPool.pl scripts from the feba code base (https://bitbucket.org/berkeleylab/feba/)
Scripts to analyze barcode pairs: the first step uses barcodePairs.pl from the FEBA code base. The next steps, bpFilter.pl and byGenePairs.pl, are included in small.zip, along with a perl library used, pbutils.pm
Combining the runs and correcting for chromosomal position is implemented in R, see dblStats.R (in small.zip)
First, three RB-TnSeq (randomly barcoded transposon sequencing) libraries were constructed (named ML1, ML2, and ML3), and in each library of single mutants, the random 20-nucleotide barcodes were mapped to insert locations. Then, five runs of large collections of double mutants were made by transformation (either ML1 x ML2 or ML3 x ML2) and selected on blood agar plates with both antibiotics. The Cre-lox system was induced to move the two barcodes into proximity, and then the pairs of barcodes were amplified and sequenced with Illumina. Barcode pairs were counted, and very rare pairs of barcodes were discarded, as these may be chimeras. Strains where both insertions were within the central 10-90% of a gene were tabulated to give the number of strains and the number of reads per gene pair. These were combined across runs (weighting the #reads lower for the first run, which had fewer strains). We computed the expected #strains and #genes for each gene pair, based on the relative abundance of each gene in each library (i.e., expected is proportionate to count for gene 1 in library 1 x count for gene 2 in library 2). We then combined counts and expected for the two "directions" for each gene pair. Then, we examined the bias due to the chromosomal position of the two genes: we divided the genome into 30 bins and computed, for each pair of bins, the median #reads/expected and #strains/expected. These final expected #reads and #strains were scaled by these biases. Finally, for each pair of genes with sufficient coverage, we computed the read ratio (#reads / expected) and a z score for the number of strains, (#strains - expected)/sqrt(expected). A similar analysis was also conducted using all insertions in each gene (0-100%) instead of 10-90%. The latter analysis increases the number of usable insertions but may include some non-disrupting insertions.
We also generated test sets of the same kind, using a collection of 96 double mutants. After inducing Cre-lox and extracting DNA, we conducted PCRs with varying levels of template (300 - 1000 ng). These tests confirmed that rare strains were chimeras.
