3RAD datasets used for phylogenomic, species delimitation, biogeography, and introgression analyses on Dugesia from Corsica and Sardinia
Data files
Mar 04, 2025 version files 38.56 GB
-
Dols-Serrate_et_al_MPE_2025.tar.gz
38.56 GB
-
README.md
10.58 KB
Mar 04, 2025 version files 38.56 GB
-
Dols-Serrate_et_al_MPE_2025.tar.gz
38.56 GB
-
README.md
10.58 KB
Abstract
Speciation is a complex process where many evolutionary forces interplay. The Mediterranean is acknowledged as one of the most relevant biodiverse areas in the Palearctic region and researchers have long studied the species inhabiting it to pursue the goals of evolutionary biology. Here, we study a complex of freshwater flatworm species of the genus Dugesia from Corsica and Sardinia using restriction site-associated DNA sequencing (specifically, 3RAD) data to unravel their evolutionary history and tackle the processes driving it. We assess the phylogenetic relationships and population structure within the group and evaluate new species boundaries using multispecies coalescent approaches. Furthermore, we offer insights into the environmental niche model of the group and use said model to guide our sampling efforts and collect and present molecular evidence for the first time of Dugesia leporii specimens, endemic from Sardinia last spotted in 1999. Our results indicate that paleoclimatic conditions rather than microplate tectonic dynamics were likely an important driver of diversification for the Corso-Sardinian group. Furthermore, our results warrant the taxonomic re-evaluation of the group as eight primary species candidates are established based on molecular data. Our study also reveals the first case of interspecific natural hybridization reported in Dugesiidae and, to our knowledge, in Tricladida. Finally, we discuss how this hybridization might constitute a new form of hybrid speciation.
https://doi.org/10.5061/dryad.v15dv4274
Description of the data and file structure
Demultiplexed data and datasets used for phylogenomic, biogeography, and introgression analyses on Dugesia species from Corsica and Sardinia.
Files and variables
File: Dols-Serrate_et_al_MPE_2025.tar.gz
Description: These data were generated using the 3RAD sequencing protocol (details in Supplementary Materials and Methods) and processed with the Stacks v2.52 de novo pipeline (available at: https://catchenlab.life.illinois.edu/stacks/manual/). The dataset includes demultiplexed sequencing data for all 90 samples, representing multiple Dugesia species from Corsica and Sardinia.
Files contained within Dols-Serrate_et_al_MPE_2025.tar.gz :
1) Demultiplexed_data_90sample.tar.gz
A compressed file containing all demultiplexed RAD pair-end reads of all 90 sequenced samples for this study. Samples were processed with process_radtags, a program provided within the modular pipeline of Stacks. The following files are present for each processed sample:
SAMPLE_ID.1.fq.gz- file containing all first reads of pair-end sequencing.SAMPLE_ID.2.fq.gz- file containing all second reads of pair-end sequencing.SAMPLE_ID.rem.1.fq.gz- file containing discarded first reads due to low quality or missing restriction enzyme cut site.SAMPLE_ID.rem.2.fq.gz- file containing discarded second reads due to low quality or missing restriction enzyme cut site.
2) popmap_10x_63.tsv
Popmap file containing the 63 samples that were used for subsequent downstream analyses and their affiliation to a population. Sample IDs and population affiliation are written down in two tab-separed columns as follows:
SAMPLE_ID1<tab>POPULATION_ACRONYM1
SAMPLE_ID2<tab>POPULATION_ACRONYM1
SAMPLE_ID3<tab>POPULATION_ACRONYM2
[...] <tab> [...]
This is an example of the first five lines of the popmap_10x_63.tsv document:
MR1286_20 UC
MR1286_25 UC
MR1286_28 UC
MR1286_29 UC
MR1287_16 AI
The abovementioned POPULATION_ACRONYM variable within the popmap file describes the sample locality from which the samples were collected. Downstream analyses often require data filtering based, for the most part, on two capital parameters from the populations program (integrated within the modular pipeline of Stacks), namely -r and -p.
-r- controls the minimum percentage of individuals in a population required to process a locus for that population.-p- controls the minimum number of populations a locus must be present in to process a locus.
Thus, it is important to ensure that the popmap file has no typos or mispecified populations. In total, there are 17 population acronyms plus the acronym for the outgroup (Dugesia hepta, HEP), whose samples were treated as a unified population. The following list showcases the established acronyms and the population they refer to:
[Locality name, Island] [Population acronym] [Coordinates]
Bunnari, Sardinia BU 40°42'05"N, 8°35'30"E
Mascari, Sardinia MA 40°41'60"N, 8°35'23"E
Mulinu, Sardinia MU 40°24'25"N, 8°37'35"E
Monte Cidro, Sardinia MC 39°27'03.5"N, 8°28'27.3"E
Fluminimaggiore, Sardinia FL 39°26'11.4"N, 8°30'35.1"E
Sa Carcaredda, Sardinia SC 40°00'04.4"N, 9°25'16.0"E
Monte Albo, Sardinia MT 40°34'30"N, 9°40'31"E
Silis, Sardinia SI 40°45'24.0"N 8°43'28.0"E
U Crocolli, Corsica UC 42°03'31.9"N, 8°57'54.5"E
Aiaccio, Corsica AI 41°58'25.0"N, 8°49'53.4"E
Corte, Corsica CT 42°18'18.3"N, 9°09'55.9"E
Camping Campita, Corsica CC 42°23'34.2"N, 9°10'34.8"E
Setti Polli, Corsica SP 41°55'43.9"N, 8°52'15.6"E
Calvi, Corsica CA 42°29'10.8"N, 8°48'09.2"E
Montegrosso, Corsica MG 42°32'17.4"N, 8°50'38.5"E
Barchetta, Corsica BA 42°30'24.2"N, 9°22'27.3"E
Turghja, Corsica TG 41°51'09.1"N, 8°58'00.1"E
#############################[Outgroup]##################################
Dugesia_hepta HEP Collected across BU, MA & SI
As an important note, the readership will notice that when performing phylogenetic tree inference analyses Stacks has a particular way of conducting things. If one wishes to obtain a tree wherein all samples are represented, the popmap file will require minor modifications. Namely, each sample should constitute its very own 'population', just as showcased in the example below:
MR1286_20 UC1286_20
MR1286_25 UC1286_25
MR1286_28 UC1286_28
MR1286_29 UC1286_29
MR1287_16 AI1287_16
But then, what about -r and -p if each sample consitutes a population? Fear not, the solution is simple. The readership should first filter the loci using the regular popmap_10x_63.tsv file. After that, using the following UNIX command they will built a whitelist by processing one of the output files from populations:
grep -v "^#" populations.sumstats.tsv | cut -f 1 | sort | uniq > whitelist.tsv
All filtered loci tags will be stored in the whitelist.tsv file. Then, the readership will have to run populations anew but this time with the modified popmap file while indicating the program to use the whitelist that we just created using the flag -W (path to a file containing whitelisted loci to include in the export), and specifying the desired output format. In our particular case for a phylogenetic tree, usually using the flag --phylip-var-all. Be sure to check all possible output formats of populations at https://catchenlab.life.illinois.edu/stacks/comp/populations.php
3) Folders /Dryad_OP, /Dryad_DF and, /Dryad_OVM .
Each folder contains all intermediate datasets and loci catalogs produced from running the denovo_map.pl wrapper provided with the modular pipeline of Stacks, along with their corresponding blacklisted loci, which are used to filter loci and SNPs for downstream analyses. Assembly parameters need to be specified when using the de novo pipeline. There are three basic assembly parameters that the readership will care about:
m- minimum number of identical reads to establish a stack (an equivalent to an allele).M- number of mismatches allowed between stacks within individuals to identify individual loci.n- number of mismatches allowed between stacks between individuals to identify population or metapopulation loci.
Each of the dataset folders herein provided comprise:
- The
/Dryad_OPfolder includes loci assembled using the de novo RAD loci assembly parametersm2 M1 n2and all associated output files. - The
/Dryad_DFfolder contains loci assembled with the parametersm3 M2 n1and all their associated output files. - The
/Dryad_OVMfolder contains loci assembled with the parametersm3 M5 n6and all their associated output files.
Blacklists were generated by filtering the catalog.fa.gz file from each RAD loci dataset (OP, DF, and OVM) to remove contaminants (see Materials and Methods and Supplementary Materials and Methods). These blacklists are labeled as blacklist_*.tsv. To reproduce any desired analysis, it is important to ensure that all Stacks output files provided within a given dataset folder are not modified or stored in different subfolders.
Each sample within each data folder presents the following datasets:
SAMPLE_ID.tags.tsv.gz- assembled loci for a given sample.SAMPLE_ID.snps.tsv.gz- model calls from each locus within a given sample.SAMPLE_ID.alleles.tsv.gz- haplotypes/alleles recorded from each locus within a given sample.SAMPLE_ID.matches.tsv.gz- matches of loci from a given sample to the loci catalog created from all processed samples.SAMPLE_ID.matches.bam- sorted and collated matches to the catalog.
Concurrently, within each folder there are the following datasets as well:
catalog.fa.gz- consensus catalog loci containing the consensus sequence for each locus produced by each processed sample.catalog.calls- file containing the output of the SNP (single nucleotide polymorphism) calling model for each nucleotide in the analysis.catalog.snps.tsv.gz- model calls from each locus in the analysis.catalog.tags.tsv.gz- assembled loci in the analysis.catalog.alleles.tsv.gz- haplotypes/alleles recorded in the analysis.
Detailed instructions on the obtention of these datasets can be found in the Materials and Methods and Supplementary Materials and Methods sections of the manuscript Mixed, not stirred: Genomic data confirm the first case of interspecific hybridization in planarian triclads (Platyhelminthes: Tricladida) and raise questions about a possibly novel form of hybrid speciation (Dols-Serrate et al., 2025). Therein, the readership will find all indications and filtering steps necessary to reproduce the analyses performed with these data. A more thorough explanation of all Stacks output files contained within each data folder (OP, DF, and OVM) can be found at https://catchenlab.life.illinois.edu/stacks/manual/#sfiles as well.
Usage notes
The following is a list of all different data types sorted by file suffix included in this repository along with brief descriptions on how to work with them. However, it is advisable not to meddle with any file, specially if it starts with catalog*, to avoid problems with file interpretation in downstream analyses.
Descriptions of FASTA, and FASTQ file formats can be found in the SAMtools file-format specifications page.
.bam: Compressed binary sequence alignment/mapping format. Can be visualized with SAMtools.
.tsv or .tsv.gz: tab-delimited plain text files. These files can be viewed or manipulated with any plain text editor (e.g. less or zless UNIX command, Nano, Vim, etc.).
.fa.gz: Compressed plain text DNA sequence information in FASTA format.
.fq.gz: Compressed plain text DNA sequence information in FASTQ format.
.tar.gz: Compressed archive file format obtained using the command tar -czfv. Can be decompressed using the UNIX command tar -xvcf.
