Data from: Calibrating and documenting host-switching and evolution of incompatibility loci for two closely related Wolbachia clades

Data files

Apr 06, 2026 version files 194.61 MB

README.md

14.71 KB
wMel_like_dryad_package.tar.gz

194.59 MB

Abstract

Maternally inherited Wolbachia alphaproteobacteria are the most common arthropod endosymbionts. Often Wolbachia spread to high frequencies through cytoplasmic incompatibility, in which cif loci act through sperm to kill embryos lacking Wolbachia . Closely related Wolbachia with diverse cif loci often associate with anciently diverged hosts, but the timescale of associations remains uncertain. We produce new calibrations based on filarial nematodes with vertically inherited Wolbachia that codiverge with their hosts. Applying these calibrations to Wolbachia variants closely related to pathogen-blocking wMel from Drosophila melanogaster , we demonstrate that over a timescale of 1-2 million years, a core set of single-copy Wolbachia loci evolve largely through bifurcation rather than by gene exchange with distant Wolbachia . Dating bifurcating core genomes, we show that "wMel-like" Wolbachia diverged 2.1x10 5 -2.4x10 6 years inhabit dipteran and hymenopteran hosts diverged more than 10 8 years. Previous published analysis of variants related to w Ri from D. simulans , the first Wolbachia found in a drosophilid, concluded that "wRi-like" Wolbachia spread among different Drosophila in tens of thousands of years. However, our new calibrations suggest these estimates from a mutation-based calibration underestimated wRi-like spread by about a factor of seven. In addition, cif exchanges between wMel-like and wRi-like Wolbachia genomes have occurred over ∼10 4 -10 6 years. Comparing intact cif loci found in various Wolbachia , we find function-preserving selection in their evolution. We discuss these results in light of theoretical predictions concerning selection on cytoplasmic incompatibility phenotypes within and among host lineages. The wMel variants analyzed may offer new options for Wolbachia -based biocontrol efforts.

Dataset DOI: 10.5061/dryad.g1jwstr6j

Description of the data and file structure

This data archive contains the scripts, phylogenetic gene sets, cif and serine recombinase sequences, and Alphafold results used for the analyses in Shropshire et al. (2026, in press).

Files and variables

File: wMel_like_dryad_package.tar.gz

This repository contains the data related to Shropshire et al. (2026, in press), also available as a preprint: https://www.biorxiv.org/content/10.64898/2026.02.13.705778v1. Associated phylogenetic and gene extraction scripts and pipelines can also be obtained at the GitHub repository: https://github.com/brandonscooper/calibrating-wolbachia, but are also included here. A description of the included directories and files follows.

phylogenetics/ #Contains the genes and scripts used for phylogenetic reconstruction.
├─README.txt
├─scripts/ #Contains the RevBayes scripts for phylogram and relative chronogram inference, and a script to convert those relative chronograms to absolute given a calibration.
├──README.md
├──snp_count.py #Python script that counts the number of pairwise differences between all possible pairs in an aligned FASTA file.
├──relative_to_absolute_conversion_script/
├───relative_to_absolute_chronogram.py #Python script that converts relative chronograms to absolute given a calibration
├────wMel_like_example.log
├────wMel_like_example.trees #These two files are a test case for the above python script.
├───revbayes_scripts/
├────phylogram.Rev #RevBayes script used to generate the phylograms in the manuscript.
├────relaxed_clock_7_7_relative_chronogram.Rev #RevBayes script used to generate the chronograms in the manuscript.
├─D_incompta_genome/
├──D_incompta.fa #FASTA file containing the low quality Drosophila incompta nuclear genome. As sequencing depth was insufficient for de novo assembly, we aligned to the D. virilis reference.
├─wolbachia/
├──README.txt
├──linked_cif_serine_recombinase/ #Contains the sequence data used to generate the phylogenetic trees of cifA-T1 and serine recombinase genes associated with each other
├───cif/
├────cifA_T1.fa #FASTA file containing the sequence data for linked cifA-T1 genes. Phylograms and chronograms can be generated from this data by running the command 'revbayes phylogram.Rev' or 'revbayes relaxed_clock_7_7_relative_chronogram.Rev' in this directory. The referenced scripts are present in the phylogenetics/scripts/revbayes_scripts folder. We used RevBayes 1.1.1 for this manuscript.
├────partitions.txt #Read by the RevBayes scripts when generating trees with this data.
├───serine/
├────sr3WO_-_WOMelB.fa #FASTA file containing the sequence data for linked serine recombinase genes. Phylograms and chronograms can be generated from this data the same way as the above cifA-T1 genes.
├────partitions.txt #Read by the RevBayes scripts when generating trees with this data.
├──wmel_like/ #Contains the sequence data used to generate the phylogenetic trees for the 20 wMel-like Wolbachia.
├────wmel_like.fa #FASTA file containing the concatenated single copy genes.
├────wmel_like_1.fasta #As the above file, but split by codon position. This is position 1.
├────wmel_like_2.fasta #This is position 2.
├────wmel_like_3.fasta #Phylograms and chronograms can be generated from this data by running the command 'revbayes phylogram.Rev' or 'revbayes relaxed_clock_7_7_relative_chronogram.Rev' in this directory. The referenced scripts are present in the phylogenetics/scripts/revbayes_scripts folder. We used RevBayes 1.1.1 for this manuscript.
├────individual_genes/ #Folder containing the above genes, but not concatenated. Each file inside contains the sequence data for the named gene in FASTA format.
├──wmel_plus_wri_like/ #Contains the sequence data used to generate the phylogenetic trees for the 20 wMel-like Wolbachia plus 8 wRi-like Wolbachia. The contents are arranged exactly as the above wmel_like folder.
├──wri_like/ #Contains the sequence data used to generate the phylogenetic trees for the 8 wRi-like Wolbachia. The contents are arranged exactly as the above wmel_like folder.
├─mitochondria/ #Contains the 13 mitochondrial protein-coding genes concatenated from the species we used for pairwise comparisons. Each folder inside contains one species pair (or 4 species, in the Nomada folder). Pairwise differences can be calculated with snp_count.py, which is in the phylogenetics/scripts folder.
├──bmalayi_bpahangi/ #Brugia malayi and B. pahangi
├──bocqueti_affchauv/ #Drosophila bocqueti and D. aff. chauvacae
├──borealis_incompta/ #Drosophila borealis and D. incompta
├──erecta_yakuba/ #Drosophila erecta and D. yakuba
├──grimshawi_willistoni/ #Drosophila grimshawi and D. willistoni
├──mel_sim/ #Drosophila melanogaster and D. simulans
├──mojavensis_virilis/ #Drosophila mojavensis and D. virilis
├──nomada/ #Nomada flava, N. ferruginata, N. leucophthalma, and N. panzeri.
├──OOChengi_OVolvulus/ #Onchocerca ochengi and O. volvulus
├──pseudoobscura_suzukii/ #Drosophila pseudoobscura and D. suzukii
├──seguyi_malagassya/ #Drosophila seguyi and D. malagassya
├──ztaronus_ztscasi/ #Zaprionus taronus and Z. tsacasi
├─nuclear/ #Contains the 20 nuclear genes from the indicated species used to make chronograms in the manuscript. Each folder contains the 20 genes in FASTA format as [gene_name].fa, and the genes partitioned by codon position as [gene_name]_1.fasta, [gene_name]_2.fasta and [gene_name]_3.fasta. The chronograms can be run with the command 'revbayes relaxed_clock_7_7_relative_chronogram.Rev' in these directories. That script is in the phylogenetics/scripts/revbayes_scripts folder. The partitions.txt file in each directory is read by this command. We used RevBayes 1.1.1 for this manuscript.
├──borealis_incompta_chronogram/ #Drosophila borealis, D. hydei, D. incompta, and D. virilis
├──brugia_chronogram/ #Brugia malayi, B. pahangi, and Wuchereria bancrofti
├──montium_chronogram/ #Drosophila malagassya, D. bocqueti (as DMontium_STLow), D. aff. chauvacae (as DMontium_STUp), and D. jambulina
├──nomada_chronogram/ #Nomada flava, N. ferruginata, N. leucophthalma, and N. panzeri.
├──onchocerca_chronogram/ #Onchocera ochengi, O. volvulus, and Dirofilaria immitis.

codeml_control_files/ #Contains the CodeML control files used for the cif selective pressures analyses. The sequence data used is in the cif_selective_pressures top level directory.
├─README.md 
├─pairwise.control #CodeML control file for the pairwise model.
├─branch_model.control #CodeML control file for the branch model.
├─site_model.control #CodeML control file for the site models.

alphafold/ #Contains the data and scripts used for the Alphafold analyses on cif sequences.
├─data/
├──README.md #Describes the contents and structure of the following 3 zip files in detail:
├──alphafold_best.zip
├──alphafold_fasta.zip
├──alphafold_tm.zip
├─scripts/
├──README.md #Describes the following scripts:
├──cifB_tm_score_calcs.py
├──TM_analysis.R
├──plDDT_analysis.R
├──PyMol_TM.py
├──alphafold.sh
├──cifA-TM-score-calcs.py

cif-structure-evolution/
├─data/
├──README.md #Describes the contents and structure of the following 4 zip files in detail:
├──Cif_identity.zip
├──hhpred.zip
├──omega.zip
├──SWAKK.zip
├─scripts/
├──README.md #Describes the following scripts:
├──omega_2d_structure.R
├──omega_3d_structure.R
├──cif_identity_analysis.Rmd
├──hhpred.sh

wolbachia_single_copy_gene_extraction/ #Contains the pipeline for identification and extraction of single copy genes shared between Wolbachia genomes.
├─README.md #Contains usage instructions
├─gene_extraction_pipeline.bash #Master script to run the pipeline. Expects a list of arguments where each argument corresponds to a folder in the current directory containing a fasta file of that name. For example, a folder called wMel containing wMel.fa. More usage instructions included in the README and the script itself. Depends on Prokka (https://github.com/tseemann/prokka).
├─prokka_genus/ #Contains the custom Wolbachia genus database we used. For exact replication, the contents should be put into your Prokka install's genus database folder.
├──Wolbachia.phr
├──Wolbachia.pin
├──Wolbachia.psq
├─scripts/ #Contains Python scripts called by the pipeline.
├──check_gaps.py
├──fasta_concatenate.py
├──fasta_extractor.py
├──fastamutate.py
├─test/ #Contains Wolbachia genomes in the file structure expected by the pipeline for testing it. The Prokka annotations of these genomes have already been done, but can be redone by removing the prokka folder in each of these subdirectories. To run the test, use the command 'gene_extraction_pipeline.bash wAna wAu wHa wMel wRi' in this directory.
├──wAna/
├──wAu/
├──wHa/
├──wMel/
├──wRi/

recombination_analyses/ #Contains the single copy genes used to test for intragenic recombination in the wMel-likes, wRi-likes, the seven supergroup A genomes, and nematode Wolbachia.
├─README.txt #Contains details on the GARD (https://github.com/veg/hyphy/) commands used to run the analyses.
├─GARD_artifact_example/ #Contains input sequence and GARD output for an artifact GARD produces when examining very closely related sequences.
├──RefSeq_WP_010962405.1.fa #FASTA format input sequence
├──RefSeq_WP_010962405.1_gard_results.txt #GARD output containing the artifact
├─seven_supergroup_A/ #Contains the individual genes used for GARD analyses from the seven supergroup A Wolbachia genomes.
├──individual_genes/ #Contains the sequence data in the format [gene_name].fa
├──outlier_genes/ #Contains the sequence data for the sole outlier gene. It shows evidence of horizontal transmission from distantly related Wolbachia.
├───RefSeq_WP_010962975.1.fa
├─supergroup_D_nematode #Contains the individual genes used for GARD analyses from the supergroup D nematode Wolbachia wLs, wBm, wBp, and wWb.
├──individual_genes/ #Contains the sequence data in the format [gene_name].fa
├─wmel_like/ #Contains the individual genes used for GARD analyses from the 20 wMel-like Wolbachia genomes.
├──all_genes/ #Contains the sequence data in the format [gene_name].fa
├──outlier_genes/ #Contains the sequence data for the four outlier genes. They do not show evidence of horizontal transmission from distantly related Wolbachia.
├─wri_like #Contains the individual genes used for GARD analyses from the eight wRi-like Wolbachia.
├──individual_genes/ #Contains the sequence data in the format [gene_name].fa

cif_selective_pressure/ #Contains the gene sequences and trees used as input for the codeML cif selection analyses. The codeML control files are in the top level codeml_control_files directory. The sequence data is in the PHYLIP format accepted by codeML.
├─README.txt #Contains detailed information on and commands used to generate the trees and partitions in this folder.
├─cifA_truncated #Contains input sequence data and trees for analyses on truncated cifA genes.
├──partition1_tree.newick #Newick format tree for the first partition
├──partition2_tree.newick #Newick format tree for the second partition
├──cifA_truncated.phy #PHYLIP format sequence data, unpartitioned
├──cifA_truncated_partition1.phy #PHYLIP format sequence data for partition 1
├──cifA_truncated_partition2.phy #PHYLIP format sequence data for partition 2
├─cifA_intact #Contains sequence data and trees in the same format as the above directory
├──partition1_tree.newick
├──partition2_tree.newick
├──cifA_truncated.phy
├──cifA_truncated_partition1.phy
├──cifA_truncated_partition2.phy
├─cifB_intact #Contains sequence data and the tree for analyses on intact cifB genes.
├──cifB-intact.newick #The Newick format tree for this gene
├──cifB-intact.phy #PHYLIP format sequence data, unpartitioned
├──dub.phy #The deubiquitinase domain only
├──nuc1.phy #Nuclease 1 domain only
├──nuc2.phy #Nuclease 2 domain only
├──nondomain.phy #Nondomain regions only

SI_movies/ #Contains supplementary movies 1, 2 and 3.
├─README.txt #Contains legends for each movie.
├─Movie S1.mp4
├─Movie S2.mp4
├─Movie S3.mp4

serine_recombinase_sequences/ #Contains the serine recombinase sequences used in the manuscript in FASTA format.
├─README.txt
├─sr1WO.fa
├─sr2WO.fa
├─sr3WO.fa

cif_sequences/ #Contains the cif sequences used in the manuscript in FASTA format.
├─README.txt
├─genes/ #Contains the data. Files are named [Wolbachia name].fa and contain all cifs in that genome.

Code/software

Gene sequence FASTA and PHYLIP files are plain text. Phylogenetic trees can be viewed with FigTree (https://tree.bio.ed.ac.uk/software/figtree/).

Phylogenetic reconstruction was done with RevBayes 1.1.1 (https://github.com/revbayes/revbayes)

Selection analyses were done with CodeML 4.10.7 https://github.com/abacus-gene/paml. Trees used as input for CodeML were generated with raxml-ng 1.2.2 (https://github.com/amkozlov/raxml-ng).

Intragenic recombination analyses were done with GARD 2.5.52 (https://github.com/veg/hyphy/).

The Wolbachia single copy gene extraction pipeline depends on Prokka (https://github.com/tseemann/prokka) for annotation and MAFFT (https://github.com/GSLBiotech/mafft) for alignment. We used Prokka 1.14.5 and MAFFT 7.453.

All scripts in this repository are either R or Python.

Access information

Other publicly accessible locations of the data:

Phylogenetic inference scripts and the Wolbachia single copy gene identification and extraction pipeline can also be found here: https://github.com/brandonscooper/calibrating-wolbachia