Phylogenomic discordance is driven by wide-spread introgression and incomplete lineage sorting during rapid species diversification within rattlesnakes (Viperidae: Crotalus and Sistrurus)

Myers, Edward 1 ; Rautsaw, Rhett2; Borja, Miguel3; Jones, Jason4; Grünwald, Chris4; Holding, Matthew5; Grazziotin, Felipe6; Parkinson, Christopher2

Published Jun 20, 2024 on Dryad. https://doi.org/10.5061/dryad.crjdfn396

Data files

Jun 20, 2024 version files 87.14 MB

All_gene_trees.zip

22.31 MB
Final_Consensus_Dated_Tree.tre

73.90 KB
Rattlesnake_final_fastas.zip

64.71 MB
README.md

5.44 KB
Supplemental_Species_Groups_Table_S1.xlsx

12.46 KB
Supplemental_Table_S2.xlsx

33.48 KB

Abstract

Phylogenomics allows us to uncover the historical signal of evolutionary processes through time and estimate phylogenetic networks accounting for these signals. Insight from genome-wide data further allows us to pinpoint the contributions to phylogenetic signal from hybridization, introgression, and ancestral polymorphism across the genome. Here, we focus on how these processes have contributed to phylogenetic discordance among rattlesnakes (genera Crotalus and Sistrurus), a group for which there are numerous conflicting phylogenetic hypotheses based on a diverse array of molecular datasets and analytical methods. We address the instability of the rattlesnake phylogeny using genomic data generated from transcriptomes sampled from nearly all known species. These genomic data, analyzed with coalescent and network-based approaches, reveal numerous instances of rapid speciation where individual gene trees conflict with the species tree. Moreover, the evolutionary history of rattlesnakes is dominated by incomplete speciation and frequent hybridization, both of which have likely influenced past interpretations of phylogeny. We present a new framework in which the evolutionary relationships of this group can only be understood in light of genome-wide data and network-based analytical methods. Our data suggest that network radiations, like those seen within the rattlesnakes, can only be understood in a phylogenomic context, necessitating similar approaches in our attempts to understand evolutionary history in other rapidly radiating species.

La filogenómica nos permite descubrir la señal histórica de los procesos evolutivos a través del tiempo y estimar redes filogenéticas tomando en cuenta estas señales. El conocimiento de datos genómicos incluso permiten distinguir la contribución de la señal filogenética de la hibridación, introgresión, y de polimorfismos ancestrales a lo largo del genoma. En este trabajo nos enfocamos en como estos procesos han contribuido a la discordancia filogenética entre las serpientes de cascabel, un grupo en el que hay numerosos conflictos en las hipótesis filogenéticas obtenidas de un grupo variado de datos moleculares y métodos analíticos. Nosotros abordamos la inestabilidad de la filogenia de las serpientes de cascabel (generos Crotalus y Sistrurus) usando datos genómicos generados de transcriptomas muestreados en la mayoría de las especies conocidas. Estos datos genómicos, analizados con métodos basados en coalescencia y redes filogenéticas, revelaron numerosos casos de especiación rápida donde los arboles de genes individuales conflictúan con el árbol de especies. Además, la historia evolutiva de las serpientes de cascabel esta dominada por una especiación incompleta y una frecuente hibridación, las cuales probablemente han influenciado interpretaciones pasadas de las filogenias. Nosotros presentamos un nuevo marco en el que las relaciones evolutivas de este grupo solo pueden ser entendidas en base a datos de genomicos y métodos analíticos basados en redes filogenéticas. Nuestros datos sugieren que la radiación en redes filogenéticas, como se ha visto dentro de las serpientes de cascabel, solo puede ser entendida en un contexto filogenómico, necesitando aproximaciones similares en nuestro intento de entender la historia evolutiva en otras especies con radiaciones rápidas.

Sequence data, individual ML gene trees, and dated species tree

Description of the data and file structure

Rattlesnake_final_fastas.zip - This directory contains 3,707 fasta files used for data analysis in this paper; all fasta files end in .fasta. Loci in this directory include benchmarking universal single-copy othologs ("BUSCOs"; these are named like '301078at32523.fasta'), ultra-conserved loci (named for example like 'uce-7817.fasta'), anchored hybrid enrichment loci ('AHE-L223.fasta'), traditional Sanger loci (e.g. 'HLCS.fasta'), and novel orthologous transcript sequences ('OG0005327.fasta').

All_gene_trees.zip - this directory contains 3,707 ML gene trees from IQtree; all gene tree files end in .treefile. The gene trees are named similarly to the above locus file names and each name matches directly to the names in the fasta directory.

Final_Consensus_Dated_Tree.tre - the dated species tree with all outgroup taxa.

Description of supplemental files

Supplementary_Text_S1.docx (Zenodo)- Additional methods regarding the sequence data assembly.

Supplemental_Figures.pdf (Zenodo) - Supplemental figures, including (1) ML mtDNA gene tree with all samples, (2) comparison of coalescent based species tree, concatenated species tree, and mtDNA gene tree, (3) support for varying H max values from SNaQ analyses, and (4) resulting network phylogenies with different H max values.

Supplemental_Species_Groups_Table_S1.xlsx - Overview of previously hypothesized species groups within rattlesnakes and our updated interpretation of these groups based on our phylogenomic analyses.

Supplemental_Table_S2.xlsx - Table with information regarding the specimens sequenced in this study ('Specimen ID' and 'field series' columns), their species assignment ('Species' and 'Astral Spp Assign' headings), SRA accession numbers ('GenBank SAMN' heading), frequency of missing data ('Frequency of loci recover for sample' column), and museum accession numbers where available ('Museum Accession' column).

Additional details on data analysis:

To assembly the genomic sequence data we used the PhyProbe pipeline:

https://github.com/RhettRautsaw/PhyProbe

To assembly mtDNA loci we followed the MitoSIS pipeline:

https://github.com/RhettRautsaw/MitoSIS

IQTREE was run for each loci independently using the following:

iqtree2 -s xx -pre xx -B 1000 -T AUTO

IQTREE was also used to get confidence intervals around branch lengths for divergence dating analyses:

iqtree2 -s conc_Crotalus_genome_one_spec_taxon.fa -p Crotalus_partitions.txt.best_scheme.nex -b 1000 -g Astral_Crot_NoBoots.tre --prefix Astral_Crot_Constr_08032021 -T AUTO

ASTRAL analyses were run with the following (note that additional analyses in ASTRAL, e.g., the polytomy test, require additional flags):

java -D"java.library.path=lib/" -jar astral.5.15.5.jar -i FINAL_CAT_TREE_FILE.tre -a Crot_spp_assigns_phased.txt -o Crotalus_astral_tree.tre -T 80 2>out.log

To create a vcf file for introgression analyses in Dsuite we used STAR, samtools, bcftools, and vcftools as follows (note this is for a single sample labeled as 'Cscut-CLP1823'):

STAR --genomeDir /zfs/venom/Myers/Crotalus_fastqs --runThreadN 4
--readFilesIn fastqs/Cscut-CLP1823_RNA_VG_R1_trim.fastq.gz,fastqs/Cscut-CLP1823_RNA_VG_R2_trim.fastq.gz
--readFilesCommand zcat
--outFileNamePrefix results/Cscut-CLP1823
--outSAMattrRGline ID:Cscut-CLP1823
--outSAMtype BAM SortedByCoordinate
--outSAMattributes Standard

samtools markdup Cscut-CLP1823_positionsort.bam Cscut-CLP1823_markdup.bam

gatk CreateSequenceDictionary -R GCA_003400415.2_UTA_CroVir_3.0_genomic.fna

samtools faidx GCA_003400415.2_UTA_CroVir_3.0_genomic.fna

samtools index Cscut-CLP1823_markdup.bam

bcftools mpileup -f ../GCA_003400415.2_UTA_CroVir_3.0_genomic.fna Cscut-CLP1823_markdup.bam > Cscut-CLP1823_raw.bcf

bcftools mpileup -Ou -f ../GCA_003400415.2_UTA_CroVir_3.0_genomic.fna Cscut-CLP1823_markdup.bam | bcftools call -mv -Ob -o Cscut-CLP1823_calls.bcf

bcftools mpileup -Ou -f GCA_003400415.2_UTA_CroVir_3.0_genomic.fna -r CM012306.1 bam/Cscut-CLP1823.markdup.bam | bcftools call -mv -Ob -o Cscut-CLP1823_calls.bcf

Run bcftools mpileup across multiple samples (only two are listed here)

bcftools mpileup -Ou -f ../GCA_003400415.2_UTA_CroVir_3.0_genomic.fna Cscut-CLP1823_markdup.bam Cadam-DRR0044_markdup.bam | bcftools call -vmO z --format-fields GQ,GP -o output_04262022.vcf.gz

vcftools to reduce missing data

vcftools --vcf total_chroms_renamed.vcf --keep Dsuite2Keep.txt --max-missing 0.5 --recode --out Dsuite_05102022_50p

Lastly we ran Dsuite on the filtered vcf file:

Build/Dsuite Dtrios -c -n Dsuite_05112022 -t Dsuite_phylo.tre Dsuite_05102022_50p.recode.vcf Dsuite_spp_assigns.txt

To run Snaq in julia we used the following commands after creating the concordance factor table from all individual gene trees:

tree = readTopology("Astral_all_taxa_08102022.tre");
CF = readTableCF("tableCF_all_taxa_reduced_08292022.csv");
network0_all_taxa_08292022 = snaq!(tree, CF, hmax=0, filename="network0_all_taxa_08292022", seed=8765)