Streptococcus pneumoniae is common nasopharyngeal commensal bacterium and important human pathogen. Vaccines against a subset of pneumococcal antigenic diversity have reduced rates of disease, without changing the frequency of asymptomatic carriage, through altering the bacterial population structure. These changes can be studied in detail through using genome sequencing to characterise systematically-sampled collections of carried S. pneumoniae. This dataset consists of 616 annotated draft genomes of isolates collected from children during routine visits to primary care physicians in Massachusetts between 2001, shortly after the seven valent polysaccharide conjugate vaccine was introduced, and 2007. Also made available are a core genome alignment and phylogeny describing the overall population structure, clusters of orthologous protein sequences, software for inferring serotype from Illumina reads, and whole genome alignments for the analysis of closely-related sets of pneumococci. These data can be used to study both bacterial evolution and the epidemiology of a pathogen population under selection from vaccine-induced immunity.
Maximum likelihood phylogeny based on the core genome alignment of 616 Streptococcus pneumoniae isolates
Newick format maximum likelihood phylogeny generated using the 106,196 polymorphic sites in a core genome alignment of 616 Streptococcus pneumoniae isolates. The tree was produced by RAxML using the general time reversible substitution model with a four category gamma distribution to correct for rate heterogeneity.
SPARC.core_genes.tree
Core genome codon alignment of 616 Streptococcus pneumoniae isolates
FASTA format 1.14 Mb codon alignment generated through concatenation of individual alignments of the 1,194 coding sequences found to be present in a single copy in each of 616 Streptococcus pneumoniae isolates sampled from Massachusetts between 2001 and 2007. File is compressed using tar and gzip.
SPARC.core_genes.aln.tar.gz
Whole genome alignments of 15 sequence clusters of similar isolates
Fifteen FASTA format whole genome alignments, each corresponding to one of the monophyletic sequence clusters identified through population clustering and phylogenetics as described in Croucher et al (2013) Nat. Genet. 45:656-663. Alignments were generated through mapping of paired Illumina reads to a reference sequence, itself omitted from the alignment, using SMALT. Files are compressed as a single archive using tar and gzip.
SPARC.sequenceClusters.aln.tar.gz
Software and reference sequence for inferring serotype from Illumina sequence data
This software is a simple script that uses BWA mapping to identify the likely serogroup of a pneumococcal isolate based on paired end FASTQ data.
pneumococcalSerotyper.tar.gz
Predicted protein coding sequences from 616 S. pneumoniae isolates
This compressed archive comprises a FASTA file containing the DNA sequences of all predicted protein coding sequences from 616 S. pneumoniae isolates collected from Massachusetts between 2001 and 2007. Each sequence is labelled with a unique identifier (of the form, “ERSX_Y”, where “ERSX” is the sample accession code in the European Nucleotide Archive and Y is an incrementing index) and, where applicable, the COG of the translated protein (of the form, “SPARC1_CLSZ” or “SPARC1_CLSTZ”, where Z is a number).
SPARC_CDS_dna_sequences.fasta.tar.gz
Predicted protein sequences from 616 S. pneumoniae isolates
This compressed archive comprises a FASTA file containing the amino acid sequences translated from all predicted protein coding sequences from 616 S. pneumoniae isolates collected from Massachusetts between 2001 and 2007. Each sequence is labelled with a unique identifier (of the form, “ERSX_Y”, where “ERSX” is the sample accession code in the European Nucleotide Archive and Y is an incrementing index) and, where applicable, the COG to which the protein belongs (of the form, “SPARC1_CLSZ” or “SPARC1_CLSTZ”, where Z is a number).
SPARC_CDS_protein_sequences.fasta.tar.gz
Draft reference genome sequences for each of the 15 sequence clusters
This compressed archive contains 15 FASTA draft de novo assemblies used to generate the whole genome alignments within each sequence cluster. The files are named according to the sequence cluster and taxon identifier of the isolate to which the contigs relate.
SPARC_reference_sequences.tar.gz