Supplementary information from: Dating the origin of a viral domestication event in parasitoid wasps attacking Diptera

Data files

Nov 20, 2024 version files 10.05 MB

Cynipoidea_viral_domestication-Dryad.zip

10.03 MB
README.md

14.13 KB

Jan 07, 2025 version files 9.95 MB

Cynipoidea_viral_domestication-Dryad.zip

9.94 MB
README.md

14.13 KB

Abstract

Over the course of evolution, hymenopteran parasitoids have developed a close relationship with heritable viruses, sometimes integrating viral genes into their chromosomes. For example, in Drosophila parasitoids belonging to the Leptopilina genus, 13 viral genes from the Filamentoviridae family have been integrated and domesticated to deliver immunosuppressive factors to host immune cells, thereby protecting parasitoid offspring from host immune response. The present study aims to comprehensively characterize this domestication event in terms of the viral genes involved, the wasp diversity affected by this event, and its chronology. Our genomic analysis of 41 Cynipoidea wasps from six subfamilies revealed 18 viral genes that were endogenized during the early radiation of the Eucoilini+Trichoplastini clade around 75 million years ago. Wasps from this highly diverse clade develop not only from Drosophila but also from a variety of Schizophora. This event coincides with the radiation of Schizophora, a highly speciose Diptera clade, suggesting that viral domestication facilitated wasp diversiﬁcation in response to host diversiﬁcation. Additionally, in one of the species, at least one viral gene was replaced by another gene deriving from a related Filamentovirus. This study highlights the impact of viral domestication on the diversiﬁcation of parasitoid wasps.

https://doi.org/10.5061/dryad.n8pk0p35c

Description of the data and file structure

This repository contains supplementary figures from the paper "Dating the origin of a viral domestication event in parasitoid wasps attacking Diptera".

Supplementary figures :

Figures S15. Alignment of Eucoilini+Trichoplastini EVEs along with Filamentous virus genes. All plots obtained using the msa R package. TryEFV: Trybliographa Endogenous Filamentous Elements (EFV), TrEFV : Trichoplasta EFV, RhEFV : Rhoptromeris EFV, LhEFV : Leptopilina heterotoma EFV, LcEFV : Leptopilina clavipes EFV, LbEFV : Leptopilina boulardi EFV, LhFV : Leptopilina heterotoma FV (FV), LbFV : Leptopilina boulardi FV, EfFV : Encarsia formosa FV, PcFV : Psyttalia concolor FV, PoFV : Platygaster orseoliae FV, CcFV1 and CcFV2 : Cotesia congregata FV 1 and 2.

Figures S16. Individual gene phylogenies of EVEs found in Eucoilini+Trichoplastini wasp species. Phylogenies include sequences from free-living viruses in red and sequences from parasitic wasps in blue. The phylogeny type is indicated in the left corner of each figure (see Fig. S5 for type information). EVEs belonging to species in the Eucoilini+Trichoplastini tribe are shown in light blue, while the other known EVEs are shown in pale blue. Sequences found in eukaryote and bacterial genomes are shown in green and black, respectively. Confidence scores (aLRT% ultra-bootstrap support%) are shown at each node.

Supplementary tables :

TableS1. Genomic assemblies information.

Detail of the table :

Each sub-table provides detailed information on sequencing and assembly metrics. These include BUSCO metrics and scaffold size distributions. The metrics are calculated at each step using various software tools, such as MEGAHIT, SOAPdenovo-fusion, and Redundans. For this study, only the assembly with the fewest missing BUSCO genes was retained for further analysis. The notation "n/a" indicates that a specific metric was not computed for a given assembly. Additionally, the Available raw data and assemblies sub-plots includes all information about the submission of the assemblies to NCBI.

TableS2. Sample information.

Detail of the table :

This table contains all the list of the Cynipoidea specimens screens for PCR identification of the Filamentovirus ORF96. It includes the following columns : Providers: The researcher that supplied the sample data ; sample#: The unique identifier assigned to each sample for tracking and reference purposes ; taxon: The biological classification (genus and species when possible) of the organism from which the sample was taken ; Voucher location: The institution where the physical voucher specimen of the sample is stored for future reference ; barcode: The molecular sequence identifier used to uniquely identify the species or sample in genetic databases ; location: The geographic location where the sample was collected ; Method: The technique used to collect the sample.

TableS3. Homology search information for the 25 filamentous proteins endogenized in Cynipid wasps.

Detail of the table :

This table contains all the Homology search informations for the 25 filamentous proteins endogenized in Cynipid wasps. A HHpred on PDB_mmCIF70_12_Aug and a HHmer on alphafold_uniprot_Aug_2022 databases were ran on all the cluster AA alignments (HHpred and HHmer column results have the -HHpred and -HHmer names respectively). These analyses were done using the MPI toolkits (Zimmermann et al.,2018). It includes the following columns : Function/activity by similarity: The predicted biological function or activity of the protein/sequence, inferred based on similarity to known sequences ; Custer_name: The name of the cluster ; ORF_name: The name or identifier for the Open Reading Frame (ORF), representing the EVE in the genome that could potentially code for a protein ; Prot_name: The name or description of the protein encoded by the ORF ; Filamentous core: Indicates whether the EVE is part of a filamentous structural core, possibly related to specific cellular or viral structures. ; Eucoilini core: Indicates whether the EVE is part of the core set of proteins specific to the Eucoilini group ; ID-HHpred: The identifier of the match found using HHpred, a bioinformatics tool for protein homology detection and structure prediction ; Name-HHpred: The name or description of the match found using HHpred ; E-value-HHpred: The statistical measure (E-value) from HHpred, representing the probability of a random match; lower values indicate stronger confidence in the similarity ; Score_HHpred: The score assigned by HHpred, quantifying the quality of the match; higher scores indicate better alignment ; ID-HMMER: The identifier of the match found using HMMER, a tool for sequence alignment and domain detection; Name-HMMER: The name or description of the match found using HMMER ; E-value-HMMER: The statistical measure (E-value) from HMMER, indicating the likelihood of the match occurring by chance; lower values indicate higher confidence ; Bitscore-HMMER: The score from HMMER, measuring the significance of the sequence match; higher bitscores reflect more significant matches. NA means the information could not be computed for this specific column.

**TableS4. **Full table containing All EVEs metrics including blast, clusters, dN/dS analysis and phylogenetic clades numbers. This includes all the free-living viral sequences as well as all the putative Endogenous Viral Elements (EVEs) associated to them. Sequences that contain n/a on start and end positions are sequences from free-living viruses. All other cells containing n/a correspond to cells where the information was not available.

Detail of the table :

General EVE and Virus loci informations :

Cluster_hmmer: Identifier for the cluster based on HMMER analysis. The suffix "_redefined" means this sequence was added to the cluster after the HMMER process step ; Full_names: Full-length names or annotations of sequences or genes ; Short_names: Abbreviated or shorthand names for easier reference ; start: Start position of the region on the scaffold ; end: End position of the region on the scaffold ; strand: Direction of the sequence (+ for forward, - for reverse) ; Cluster_blast: Cluster identifier based on BLAST results

Blast informations :

evalue: E-value from the BLAST analysis, indicating alignment significance ; bits: Bit score from the BLAST analysis, representing alignment quality ; Length: Length of the blast hit (HSP) ; query: Query sequence name ; target: Target sequence name ; pident: Percentage identity between query and target sequences ; alnlen: Alignment length between the query and target sequences ; mismatch: Number of mismatches in the alignment ; gapopen: Number of gap openings in the alignment ; tstart: Start position of the target sequence in the alignment ; tend: End position of the target sequence in the alignment ; qstart: Start position of the query sequence in the alignment ; qend: End position of the query sequence in the alignment ; qlen: Length of the query sequence ; tlen: Length of the target sequence ; tcov: Coverage percentage of the target sequence ; qcov: Coverage percentage of the query sequence ; Species_name: Name of the species associated with the scaffold or feature ; Species_scaffold_name: Combined identifier for species and scaffold.

Open Reading Frame informations (ORF):

ORF_query: ORF identifier for the query sequence ; ORF_target: ORF identifier for the target sequence ; ORF_start: Start position of the ORF ; ORF_end: End position of the ORF ; ORF_strand: Strand orientation of the ORF ; Overlapp_ORF_EVEs: Overlap of ORFs with Endogenous Viral Elements (EVEs)

Domestication analysis informations (dN/dS):

Mean_dNdS: Average dN/dS ratio for assessing selection pressure ; Pvalue_dNdS: P-value for the dN/dS ratio comparison ; SE_dNdS: Standard error for the dN/dS ratio.

Scaffold informations :

Scaffold_name: Name or identifier of the scaffold ; Virus_count: Count of viral sequences within the scaffold ; euk_count: Count of eukaryotic sequences within the scaffold ; repeat_count: Count of repetitive sequences within the scaffold ; Scaffold_length: Total length of the scaffold.

Endogenisation Event informations :

Event: Number of the endogenisation event (forming a monophyletic clade in the phylogeny) ; Nsp_MRCA: Number of Hymenoptera species present in the most recent common ancestor of the Hymenoptera species containing the EVEs within the Event ; Nloc: Number of different loci present within the event ; Nsp: Number of different species present within the event ; Pseudogenized: Indicates whether the sequence has become a pseudogene ( presence of premature stop codon) ; Boot: Bootstrap value of the monophyletic clade supporting the Event ; Nsp_losses: Number of Hymenoptera species where the EVE is not present within the clade.

Endogenisation arguments by comparisons to other scaffold containing BUSCO genes :

Mean_cov_depth_candidat: Average coverage depth of the mapped read on the scaffold containing the EVE ; Median_cov_depth_candidat: Median coverage depth of the mapped read on the scaffold containing the EVE ; Mean_GC_candidat: Average GC content percentage on the scaffold containing the EVE ; pvalue_cov_mean: Statistical p-value for the mean coverage depth comparison ; pvalue_cov_median: Statistical p-value for the median coverage depth comparison ; Median_cov_depth_BUSCO: Median coverage depth for BUSCO genes among all scaffolds of the assembly ; Mean_cov_depth_BUSCO: Average coverage depth for BUSCO genes among all scaffolds of the assembly ; Mean_GC_BUSCO: Average GC content percentage for BUSCO genes among all scaffolds of the assembly ; FDR_cov_global: False Discovery Rate for global coverage comparisons between BUSCO cov distribution and the observed coverage on the scaffold containing the EVE ; FDR_pvalue_cov_mean: Adjusted p-value (FDR) for mean coverage depth ; FDR_pvalue_cov_median: Adjusted p-value (FDR) for median coverage depth.

Code and software

All Snakemake workflows, along with detailed software configurations and scripts, are located in the "script" directory. This directory contains the following:

Workflow files: Snakemake pipelines that define the sequence of tasks and dependencies.
Scripts: Custom Python, R, or shell scripts used within the workflows.
Software versions: Documentation or configuration files specifying the required software and their versions for reproducibility.

The workflow include the following files :

Python Scripts (used within the snakemake files)

Add_Monophyletic_table.py
Adds a table summarizing monophyletic groups identified in a dataset, potentially integrating data from phylogenetic analyses.

Add_cov_metaeuk_TE_score.py
Processes coverage data, MetaEuk outputs, and transposable element (TE) scores to integrate additional information into analysis pipelines.

Add_dNdS_informations.py
Integrates dN/dS (nonsynonymous to synonymous substitution rate) information into existing datasets.

Create_BUSCO_files_for_dNdS.py
Prepares BUSCO (Benchmarking Universal Single-Copy Orthologs) files for dN/dS analysis, extracting and formatting orthologous sequences.

Mmseqs_clustering.py
Uses MMseqs2 for sequence clustering, enabling efficient similarity-based grouping.

Create_clusters.py
Clusters data (e.g., sequences or genes) based on a blast specified criterion.

Create_clusters_for_phylogeny_plot.py
Generates multiple sequence alignment clusters files for phylogenetic analysis.

Filter_Extract_and_translated_blast_loci.py
Filters, extracts, and translates loci based on BLAST results, possibly to identify coding regions or alignments.

Filter_NR_results.py
Filters results from a non-redundant (NR) database search.

Hmmer_clustering.py
Performs clustering of sequences using HMMER.

Gather_hmmer_clusters_and_add_informations.py
Collects and annotates clusters generated by HMMER (Hidden Markov Model analysis).

Remove_dup_clusters.py
Identifies and removes duplicate clusters to ensure non-redundant grouping in datasets.

save_syntenic_plot_script.py
Saves a script or configuration for generating syntenic (genomic region comparison) plots.

Synteny.py
Conducts synteny analysis, identifying conserved genomic arrangements between scaffolds containing EVEs.

Snakemake Scripts
These manage workflows using Snakemake

Snakemake_Alignment_main_Datation: Aligns sequences and prepares data for molecular dating analysis.
Snakemake_Alignment_main_Phylogeny: Aligns sequences for phylogenetic tree construction.
Snakemake_Alignment_plot_Phylogeny: Generates plots for visualizing phylogenies from aligned data.
Snakemake_Blast_Clustering2_step1: Executes the first step of a BLAST and clustering-based workflow.
Snakemake_Clustering_hmmer_step2: Manages the second step, focused on HMMER clustering.
Snakemake_add_informations_step3: Adds metadata or integrates additional information into the workflow.
Snakemake_dNdS_alignment_and_Events_step4: Prepares alignments for dN/dS analysis and event detection.
Snakemake_dNdS_analysis_BUSCOs_part5: Analyzes dN/dS ratios in BUSCO-defined orthologs.
Snakemake_dNdS_analysis_EVES_part6: Examines dN/dS ratios in EVEs (endogenous viral elements).

Supplementary information from: Dating the origin of a viral domestication event in parasitoid wasps attacking Diptera

Data files

Abstract

README: Supplementary informations from: Dating the origin of a viral domestication event in parasitoid wasps attacking Diptera

Description of the data and file structure

Supplementary figures :

Supplementary tables :

Detail of the table :

Detail of the table :

Detail of the table :

Detail of the table :

General EVE and Virus loci informations :

Blast informations :

Open Reading Frame informations (ORF):

Domestication analysis informations (dN/dS):

Scaffold informations :

Endogenisation arguments by comparisons to other scaffold containing BUSCO genes :

Python Scripts (used within the snakemake files)