3RAD datasets used for phylogenomic, species delimitation, biogeography, and introgression analyses on Dugesia from Corsica and Sardinia

Dols-Serrate, Daniel 1 ; Guo, Longhua2; Kruglyak, Leonid3; Riutort, Marta1

Published Mar 04, 2025 on Dryad. https://doi.org/10.5061/dryad.v15dv4274

Data files

Mar 04, 2025 version files 38.56 GB

Dols-Serrate_et_al_MPE_2025.tar.gz

38.56 GB
README.md

10.58 KB

Mar 04, 2025 version files 38.56 GB

Dols-Serrate_et_al_MPE_2025.tar.gz

38.56 GB
README.md

10.58 KB

Abstract

Speciation is a complex process where many evolutionary forces interplay. The Mediterranean is acknowledged as one of the most relevant biodiverse areas in the Palearctic region and researchers have long studied the species inhabiting it to pursue the goals of evolutionary biology. Here, we study a complex of freshwater flatworm species of the genus Dugesia from Corsica and Sardinia using restriction site-associated DNA sequencing (specifically, 3RAD) data to unravel their evolutionary history and tackle the processes driving it. We assess the phylogenetic relationships and population structure within the group and evaluate new species boundaries using multispecies coalescent approaches. Furthermore, we offer insights into the environmental niche model of the group and use said model to guide our sampling efforts and collect and present molecular evidence for the first time of Dugesia leporii specimens, endemic from Sardinia last spotted in 1999. Our results indicate that paleoclimatic conditions rather than microplate tectonic dynamics were likely an important driver of diversification for the Corso-Sardinian group. Furthermore, our results warrant the taxonomic re-evaluation of the group as eight primary species candidates are established based on molecular data. Our study also reveals the first case of interspecific natural hybridization reported in Dugesiidae and, to our knowledge, in Tricladida. Finally, we discuss how this hybridization might constitute a new form of hybrid speciation.

https://doi.org/10.5061/dryad.v15dv4274

Description of the data and file structure

Demultiplexed data and datasets used for phylogenomic, biogeography, and introgression analyses on Dugesia species from Corsica and Sardinia.

Files and variables

File: `Dols-Serrate_et_al_MPE_2025.tar.gz`

Description: These data were generated using the 3RAD sequencing protocol (details in Supplementary Materials and Methods) and processed with the Stacks v2.52 de novo pipeline (available at: https://catchenlab.life.illinois.edu/stacks/manual/). The dataset includes demultiplexed sequencing data for all 90 samples, representing multiple Dugesia species from Corsica and Sardinia.

Files contained within Dols-Serrate_et_al_MPE_2025.tar.gz :

1) Demultiplexed_data_90sample.tar.gz

A compressed file containing all demultiplexed RAD pair-end reads of all 90 sequenced samples for this study. Samples were processed with process_radtags, a program provided within the modular pipeline of Stacks. The following files are present for each processed sample:

SAMPLE_ID.1.fq.gz - file containing all first reads of pair-end sequencing.
SAMPLE_ID.2.fq.gz - file containing all second reads of pair-end sequencing.
SAMPLE_ID.rem.1.fq.gz - file containing discarded first reads due to low quality or missing restriction enzyme cut site.
SAMPLE_ID.rem.2.fq.gz - file containing discarded second reads due to low quality or missing restriction enzyme cut site.

2) popmap_10x_63.tsv

Popmap file containing the 63 samples that were used for subsequent downstream analyses and their affiliation to a population. Sample IDs and population affiliation are written down in two tab-separed columns as follows:

SAMPLE_ID1<tab>POPULATION_ACRONYM1
SAMPLE_ID2<tab>POPULATION_ACRONYM1
SAMPLE_ID3<tab>POPULATION_ACRONYM2
   [...]  <tab>  [...]

This is an example of the first five lines of the popmap_10x_63.tsv document:

MR1286_20       UC
MR1286_25       UC
MR1286_28       UC
MR1286_29       UC
MR1287_16       AI

The abovementioned POPULATION_ACRONYM variable within the popmap file describes the sample locality from which the samples were collected. Downstream analyses often require data filtering based, for the most part, on two capital parameters from the populations program (integrated within the modular pipeline of Stacks), namely -r and -p.

-r - controls the minimum percentage of individuals in a population required to process a locus for that population.
-p - controls the minimum number of populations a locus must be present in to process a locus.

Thus, it is important to ensure that the popmap file has no typos or mispecified populations. In total, there are 17 population acronyms plus the acronym for the outgroup (Dugesia hepta, HEP), whose samples were treated as a unified population. The following list showcases the established acronyms and the population they refer to:

[Locality name, Island] [Population acronym]    [Coordinates]
Bunnari, Sardinia	        BU		40°42'05"N, 8°35'30"E
Mascari, Sardinia	        MA		40°41'60"N, 8°35'23"E
Mulinu, Sardinia	        MU		40°24'25"N, 8°37'35"E
Monte Cidro, Sardinia	        MC		39°27'03.5"N, 8°28'27.3"E
Fluminimaggiore, Sardinia	FL		39°26'11.4"N, 8°30'35.1"E
Sa Carcaredda, Sardinia	        SC		40°00'04.4"N, 9°25'16.0"E
Monte Albo, Sardinia	        MT		40°34'30"N, 9°40'31"E
Silis, Sardinia	                SI		40°45'24.0"N 8°43'28.0"E
U Crocolli, Corsica	        UC		42°03'31.9"N, 8°57'54.5"E
Aiaccio, Corsica	        AI		41°58'25.0"N, 8°49'53.4"E
Corte, Corsica	                CT		42°18'18.3"N, 9°09'55.9"E
Camping Campita, Corsica	CC		42°23'34.2"N, 9°10'34.8"E
Setti Polli, Corsica	        SP		41°55'43.9"N, 8°52'15.6"E
Calvi, Corsica	                CA		42°29'10.8"N, 8°48'09.2"E
Montegrosso, Corsica	        MG		42°32'17.4"N, 8°50'38.5"E
Barchetta, Corsica	        BA		42°30'24.2"N, 9°22'27.3"E
Turghja, Corsica	        TG		41°51'09.1"N, 8°58'00.1"E
#############################[Outgroup]##################################
Dugesia_hepta                   HEP             Collected across BU, MA & SI

As an important note, the readership will notice that when performing phylogenetic tree inference analyses Stacks has a particular way of conducting things. If one wishes to obtain a tree wherein all samples are represented, the popmap file will require minor modifications. Namely, each sample should constitute its very own 'population', just as showcased in the example below:

MR1286_20       UC1286_20
MR1286_25       UC1286_25
MR1286_28       UC1286_28
MR1286_29       UC1286_29
MR1287_16       AI1287_16

But then, what about -r and -p if each sample consitutes a population? Fear not, the solution is simple. The readership should first filter the loci using the regular popmap_10x_63.tsv file. After that, using the following UNIX command they will built a whitelist by processing one of the output files from populations:

grep -v "^#" populations.sumstats.tsv | cut -f 1 | sort | uniq > whitelist.tsv

All filtered loci tags will be stored in the whitelist.tsv file. Then, the readership will have to run populations anew but this time with the modified popmap file while indicating the program to use the whitelist that we just created using the flag -W (path to a file containing whitelisted loci to include in the export), and specifying the desired output format. In our particular case for a phylogenetic tree, usually using the flag --phylip-var-all. Be sure to check all possible output formats of populations at https://catchenlab.life.illinois.edu/stacks/comp/populations.php

3) Folders /Dryad_OP, /Dryad_DF and, /Dryad_OVM .

Each folder contains all intermediate datasets and loci catalogs produced from running the denovo_map.pl wrapper provided with the modular pipeline of Stacks, along with their corresponding blacklisted loci, which are used to filter loci and SNPs for downstream analyses. Assembly parameters need to be specified when using the de novo pipeline. There are three basic assembly parameters that the readership will care about:

m - minimum number of identical reads to establish a stack (an equivalent to an allele).
M - number of mismatches allowed between stacks within individuals to identify individual loci.
n - number of mismatches allowed between stacks between individuals to identify population or metapopulation loci.

Each of the dataset folders herein provided comprise:

The /Dryad_OP folder includes loci assembled using the de novo RAD loci assembly parameters m2 M1 n2 and all associated output files.
The /Dryad_DF folder contains loci assembled with the parameters m3 M2 n1 and all their associated output files.
The /Dryad_OVM folder contains loci assembled with the parameters m3 M5 n6 and all their associated output files.

Blacklists were generated by filtering the catalog.fa.gz file from each RAD loci dataset (OP, DF, and OVM) to remove contaminants (see Materials and Methods and Supplementary Materials and Methods). These blacklists are labeled as blacklist_*.tsv. To reproduce any desired analysis, it is important to ensure that all Stacks output files provided within a given dataset folder are not modified or stored in different subfolders.

Each sample within each data folder presents the following datasets:

SAMPLE_ID.tags.tsv.gz - assembled loci for a given sample.
SAMPLE_ID.snps.tsv.gz - model calls from each locus within a given sample.
SAMPLE_ID.alleles.tsv.gz - haplotypes/alleles recorded from each locus within a given sample.
SAMPLE_ID.matches.tsv.gz - matches of loci from a given sample to the loci catalog created from all processed samples.
SAMPLE_ID.matches.bam - sorted and collated matches to the catalog.

Concurrently, within each folder there are the following datasets as well:

catalog.fa.gz - consensus catalog loci containing the consensus sequence for each locus produced by each processed sample.
catalog.calls - file containing the output of the SNP (single nucleotide polymorphism) calling model for each nucleotide in the analysis.
catalog.snps.tsv.gz - model calls from each locus in the analysis.
catalog.tags.tsv.gz - assembled loci in the analysis.
catalog.alleles.tsv.gz - haplotypes/alleles recorded in the analysis.

Detailed instructions on the obtention of these datasets can be found in the Materials and Methods and Supplementary Materials and Methods sections of the manuscript Mixed, not stirred: Genomic data confirm the first case of interspecific hybridization in planarian triclads (Platyhelminthes: Tricladida) and raise questions about a possibly novel form of hybrid speciation (Dols-Serrate et al., 2025). Therein, the readership will find all indications and filtering steps necessary to reproduce the analyses performed with these data. A more thorough explanation of all Stacks output files contained within each data folder (OP, DF, and OVM) can be found at https://catchenlab.life.illinois.edu/stacks/manual/#sfiles as well.

Usage notes

The following is a list of all different data types sorted by file suffix included in this repository along with brief descriptions on how to work with them. However, it is advisable not to meddle with any file, specially if it starts with catalog*, to avoid problems with file interpretation in downstream analyses.

Descriptions of FASTA, and FASTQ file formats can be found in the SAMtools file-format specifications page.

.bam: Compressed binary sequence alignment/mapping format. Can be visualized with SAMtools.

.tsv or .tsv.gz: tab-delimited plain text files. These files can be viewed or manipulated with any plain text editor (e.g. less or zless UNIX command, Nano, Vim, etc.).

.fa.gz: Compressed plain text DNA sequence information in FASTA format.

.fq.gz: Compressed plain text DNA sequence information in FASTQ format.

.tar.gz: Compressed archive file format obtained using the command tar -czfv. Can be decompressed using the UNIX command tar -xvcf.