Ecological diversification preceded geographical expansion during the evolutionary radiation of Cataglyphis desert ants
Data files
May 06, 2024 version files 958.63 MB
Abstract
Biological diversity often arises as organisms adapt to new ecological conditions (i.e. ecological opportunities) or colonise suitable areas (i.e. spatial opportunities). Cases of geographical expansion followed by local ecological divergence are well described; they result in clades comprising ecologically heterogeneous subclades. In contrast, nothing is known about evolutionary radiation events in which ecological opportunities preceded spatial spread. Here, we show that the desert ant genus Cataglyphis likely originated in open grassland habitats in the Middle East ~18 million years ago and became a taxon of diverse species specialising in prey of different masses. Around 9 million years ago, southern Europe and northern Africa experienced aridification and were colonised by Cataglyphis, which was preadapted to the harsh environmental conditions. The result was the rapid accumulation of species, and the appearance of local assemblages containing species from different lineages that still displayed ancestral foraging specialties. These findings highlight that, in Cataglyphis, ecological diversification happened before the genus geographically spread into newly arisen suitable habitats, resulting in a clade composed of ecologically homogeneous subclades.
README: UCE sequencing Cataglyphis ants
This dataset contents files resulting from the analysis of the sequencing of ultra-conserved elements (UCE) in 36 Cataglyphis ants (as well as 11 outgroups that were sequenced by previous studies, namely Formica fuscocinerea, Iberoformica subrufa, Polyergus vinosus, Proformica mongolica, Rossomyrmex minuchae, Nylanderia terricola, Paratrechina longicornis, Lasius arizonicus, Myrmecocystus pyramicus, Myrmelachista joycei, Gnamptogenys simulans).
We have submitted SPAdes assembly contigs for 47 species (genus_species_contigs.fasta), a concatenated UCE matrix (UCE_conca_75p.charsets and UCE_conca_75p.phylip), tree files (75p_method.treefile), and a UCE bait sequence file (hymenoptera-v2-PRINCIPAL-bait-set.fasta.txt).
All datafiles can be viewed in a standard text editor.
Description of the data and file structure
SPAdes assembly contigs
The files 'genus_species_contigs.fasta' contain SPAdes assembly contigs. First, 36 Cataglyphis samples were sequenced using ultra-conserved elements (UCE) phylogenomics. Then, the resulting reads were cleaned and processed using PHYLUCE software v. 1.7.1. Raw reads were trimmed for adapter contamination using Illumiprocessor v. 2.0, which incorporates Trimmomatic. Trimmed reads were finally assembled de novo into contigs using SPAdes. Next, we added assemblies for 11 outgroup species that were obtained from previous studies (https://doi.org/10.5281/zenodo.4341310; https://doi.org/10.5281/zenodo.4061988.
UCE matrix
The file 'UCE_conca_75p.phylip' contains a UCE sequences matrix. For each species, a single raw contains all DNA information.
The file 'UCE_conca_75p.charsets' contains the delimitations of each UCE loci within the 'UCE_conca_75p.phylip'. For example, the loci uce-10.nexus spans from base 1 to base 1067.
To identify and extract UCE sequences from the bulk set of contigs for all 47 species, the PHYLUCE program was employed (match_contigs_to_probes) that uses Lastz v. 1.063 to match probe sequences to contig sequences and to create a database of hits. Min-identity and min-coverage settings were 80. Each UCE locus was aligned using MAFFT v. 7.130b and the default algorithm setting in PHYLUCE (FFT-NS-i). Different trimming strategies were tested: (i) no trimming, (ii) edge trimming, and (iii) trimming internal and external alignment regions using GBLOCKS, implemented with reduced stringency parameters (b1:0.5, b2:0.5, b3:12, b4:7). Based on the descriptive statistics generated by PHYLUCE and AMAS, the GBLOCKS strategy was used in further analyses because it achieved a good compromise between the number of informative sites and the percentage of missing data. All the loci were then concatenated to form a supermatrix and filtered the alignments for taxon occupancy, requiring the loci to be found in 0, 50, 75, 85, 90, 95, and 100% of the taxon. For each of the seven supermatrices, descriptive statistics were generated using PHYLUCE and AMAS. For further analyses, the locus set filtered for 75% taxon occupancy was used because it achieved a good compromise between the number of informative sites and the percentage of missing data.
Tree files
The files '75p_method.treefile' contain phylogenomic trees constructed based on the UCE matrix.
The file '75p_conca_iqtree.treefile' contain a phylogenomic tree based on an unpartitioned analysis.
An unpartitioned phylogenetic analysis was performed with IQ-TREE v. 2.0. We selected GTR+F+G4 as the model of sequence evolution and performed 1,000 ultrafast bootstrap replicates (UFB) and 1,000 replicates of the SH-like approximate likelihood-ratio test.
The files '75p_by_locus_iqtree.treefile' and '75p_SWSC-EN_iqtree.treefile' contain phylogenomic trees constructed based on, respectively, a partitioned by locus analysis and a sliding-window site characteristics method based on site entropy (SWSC-EN). For both partitioning strategies, we inferred a maximum-likelihood tree using IQ-TREE and the same parameters as for the concatenated analysis.
The file '75p_astral.treefile' contain a coalescent-based phylogenomic tree.
Individual gene trees (one per UCE locus) were estimated via IQ-TREE. Then, for each gene tree, model testing was performed and 1,000 UFB replicates were generated. Across all the gene trees, we collapsed nodes whose UFB support was less than or equal to 10% using Newick utilities. We then fed the collapsed gene trees into ASTRAL-III for species tree inference.
UCE bait sequence file
The file 'hymenoptera-v2-PRINCIPAL-bait-set.fasta.txt' is the UCE bait sequence file.
It contains all the UCE loci that were targeted during the sequencing of the samples. There were 2590 loci.
Sharing/Access information
We used data from two Zenodo repositories :
Code/Software
All softwares used to produce these files are referenced in the main article.
Methods
We employed ultra-conserved elements (UCE) phylogenomics. We sent tissue samples to Rapid Genomics, Florida, USA, where the DNA was extracted. The genomic libraries were then prepared and enriched using 31,829 baits targeting 2,590 UCE loci conserved across ants (Hymenoptera 2.5Kv2). Sequencing was performed utilising an Illumina NovaSeq 2x150 system. We used one to five individuals per species.
We cleaned and processed the resulting reads using PHYLUCE software (v. 1.7.1.). We trimmed raw reads for adapter contamination using Illumiprocessor v. 2.0, which incorporates Trimmomatic . Trimmed reads were assembled de novo into contigs using SPAdes.
Next, we added assemblies for 11 outgroup species that were obtained from previous studies. To identify and extract UCE contigs from the bulk set of contigs for all 47 species, we employed a PHYLUCE program (match_contigs_to_probes) that uses Lastz v. 1.0 to match probe sequences to contig sequences and to create a database of hits. Our min-identity and min-coverage settings were 80. We aligned each UCE locus using MAFFT v. 7.130b and the default algorithm setting in PHYLUCE (FFT-NS-i). We tested different trimming strategies: (i) no trimming, (ii) edge trimming, and (iii) trimming internal and external alignment regions using GBLOCKS, implemented with reduced stringency parameters (b1:0.5, b2:0.5, b3:12, b4:7). Based on the descriptive statistics generated by PHYLUCE and AMAS, we decided to use the GBLOCKS strategy in further analyses because it achieved a good compromise between the number of informative sites and the percentage of missing data. We then concatenated all the loci to form a supermatrix and filtered the alignments for taxon occupancy, requiring the loci to be found in 0, 50, 75, 85, 90, 95, and 100% of the taxon. For each of the seven supermatrices, we generated descriptive statistics using PHYLUCE and AMAS. For further analyses, we decided to use the locus set filtered for 75% taxon occupancy because it achieved a good compromise between the number of informative sites and the percentage of missing data. The resulting matrix contained 2,294 loci (mean length = 1,159 bp), 560,008 informative sites, and a gaps/missing data level of 18.5%.