Skip to main content
Dryad

Files for phylogenetic analyses and molecular diagnosis between Diplomystidae catfish species

Cite this dataset

Muñoz-Ramírez, Carlos Patricio (2023). Files for phylogenetic analyses and molecular diagnosis between Diplomystidae catfish species [Dataset]. Dryad. https://doi.org/10.5061/dryad.hqbzkh1nb

Abstract

Diplomystidae is an early-diverged family of freshwater catfish endemic to southern South America. We have recently collected five juvenile specimens belonging to this family from the Bueno River Basin, a basin which the only previous record was a single juvenile specimen collected in 1996. This finding confirms the distribution of the family further South in northern Patagonia, but poses new questions about the origin of this population in an area with a strong glacial history. We used phylogenetic analyses to evaluate three different hypotheses that could explain the origin of this population in the basin. First, the population could have originated in Atlantic basins (East of the Andes) and dispersed to the Bueno Basin after the Last Glacial Maximum (LGM) via river reversals, as it has been proposed for other population of Diplomystes as well as for other freshwater species from Patagonia. Second, the population could have originated in the geographically close Valdivia Basin (West of the Andes) and dispersed south to its current location in the Bueno Basin. Third, regardless of its geographic origin (West or East of the Andes), the Bueno Basin population could have a longer history in the basin, surviving in situ through the LGM. In addition, we conducted species delimitation analyses using a recently developed method that uses a protracted model of speciation. Our goal was to test the species status of the Bueno Basin population along with another controversial population in Central Chile (Biobío Basin), which appeared highly divergent in previous studies with mtDNA. The phylogenetic analyses showed that the population from the Bueno Basin is more related to Atlantic than to Pacific lineages, although with a deep divergence that predated the LGM, supporting in situ survival rather than postglacial dispersal. In addition, these analyses also showed that the species D. nahuelbutaensis is polyphyletic, supporting the need for a taxonomic reevaluation. The species delimitation analyses supported two new species which are described using molecular diagnostic characters: Diplomystes arratiae sp. nov. from the Biobío, Carampangue, and Laraquete basins, maintaining D. nahuelbutaensis valid only for the Imperial Basin, and Diplomystes habitae sp. nov. from the Bueno Basin. This study greatly increases the number of species within both the family Diplomystidae and Patagonia, and contributes substantially to the knowledge of the evolution of southern South American freshwater biodiversity during its glacial history. Given the important contribution to the phylogenetic diversity of the family, we recommend a high conservation priority for both new species. Finally, this study highlights an exemplary scenario where species descriptions based only on DNA data are particularly valuable, bringing additional elements to the ongoing debate on DNAbased taxonomy.

README: Files for phylogenetics analyses and molecular diagnosis between Diplomystidae catfish species

This data contain files to conduct species tree analyses with starBEAST, a prerequisite for the DELINEATE analyses and to conduct the molecular diagnostic analyses.
The results of these analyses are presented in figures and tables in a manuscript

Description of the data and file structure

These are several files used to conduct different types of analyses.

cr_cb_gh_s7_monophyly_prior.xml:
Input file to run beast, using the starBEAST3 package to estimate a dated species tree. The file is set to force the monophyly of a pair of populations (Baker and Chubut) (monophyly prior).

cr_cb_gh_s7_no_monophyly_prior.xml:
Input file to run beast, using the starBEAST3 package to estimate a dated species tree. The monophyly of a pair of populations (Baker and Chubut) is not constrained (no monophyly prior).

211_Diplos_RC.fas:
Alignment file with 211 sequences of a portion of the mitochondrial control region representing individuals of the catfish family Diplomystidae from 15 river basins. This file is used for the molecular diagnosis analysis, called from the R script file Diagnostic_nucleotides.R.

212_Diplos_CytB.fas:
Alignment file with 212 sequences of a portion of the mitochondrial cytochrome b gene representing individuals of the catfish family Diplomystidae from 15 river basins. This file is used in the molecular diagnosis analysis, called from the R script file Diagnostic_nucleotides.R.

S723.fas:
Alignment file with 23 sequences of a portion of the nuclear S7 gene (intron 1) representing individuals of the catfish family Diplomystidae from 15 river basins. This file is used in the molecular diagnosis analysis, called from the R script file Diagnostic_nucleotides.R.

GH23.fas:
Alignment file with 23 sequences of portions of the nuclear growth hormone (GH) gene representing individuals of the catfish family Diplomystidae from 15 river basins. This file is used in the molecular diagnosis analysis, called from the R script file Diagnostic_nucleotides.R.

Diagnostic_nucleotides.R:
R script file with commands used to run the molecular diagnostic analyses.

vouchers_for_DiagNuc_CR.csv:
Data frame containing grouping factors for the control region sequence alignments needed to run the DNA diagnostic analyses in the R script Diagnostic_nucleotides.R

vouchers_for_DiagNuc_cytb.csv:
Data frame containing grouping factors for the cytochrome b sequence alignment needed to run the DNA diagnostic analyses in the R script Diagnostic_nucleotides.R

nucleares_vouchers.csv:
Data frame containing grouping factors for the nuclear sequence alignment needed to run the DNA diagnostic analyses in the R script Diagnostic_nucleotides.R

Sharing/Access information

Sequence data can be accessed from GenBank accession numbers:

Code/Software

Commands, functions, and packages used in the R environment are described in detail in the file Diagnostic_nucleotides.R.

Methods

2. MATERIAL AND METHODS

2.1 Sampling

A total of eight individuals from the Bueno Basin were collected during the summer 2019 (n= 3) and 2021 (n= 5) from two different sections of the Bueno River (-40,2528 -72,6178 and -40.2561 -72.5942, respectively) near the outlet of Ranco Lake (figure 1). Specimens from 2019 were released alive after collecting a small piece of the adipose fin by clipping. Specimens from 2021 were kept as vouchers. All the vouchers show a long and continuous lateral line typical of the genus Diplomystes, so they were assigned to this genus following Arratia & Quezada-Romegialli (2017). Additionally, three individuals from the coastal basins Laraquete and Carampangue (Muñoz-Ramírez et al., 2020), genetically close to populations from the Biobío Basin, were included to increase data from the Biobío lineage for the species delimitation analysis and were considered for all purposes as part of the Biobío Basin, due to their close geographical and genetic proximity. 

2.2 Lab protocols and molecular data

A small tissue sample from the adipose fin or muscle from each specimen was used for DNA extraction, which was conducted using the DNeasy Tissue Kit (QIAGEN Inc., Chatsworth CA) following de manufacturer’s protocol. All individuals were amplified for two mitochondrial DNA regions, the Cytochrome B (cytB) and the control region (CR), and portions of two nuclear genes, the GH (growth hormone, exons three through five and introns three and four), and the S7 (intron one) following Muñoz-Ramírez et al., (2014). These newly generated data were complemented with published data from Muñoz-Ramírez et al. (2014, 2020) to conduct all analyses (see Supplementary Material Table S1 for additional details and GenBank accession numbers).

2.3 Phylogenetic estimation

The phylogenetic position of the Bueno population was estimated in two ways. First, a Maximum Likelihood genealogy was estimated using RAxML-NG v. 1.1.0 (Kozlov, 2019) with mtDNA regions cytB and CR concatenated. Each mtDNA region was treated as a separate partition with its own model of molecular evolution. Best-fit models of molecular evolution were chosen by Modeltest-NG (Darriba et al. 2020) using the Bayesian information criterion (BIC). The RAxML analysis was estimated with 1000 nonparametric bootstrap replicates , followed by a search for the best-scoring ML tree. 

Second, we estimated a population tree using the coalescent method StarBeast3 (Douglas, Jiménez-Silva, & Bouckaert, 2022) implemented as part of the software BEAST2 v2.6.6.0 (Bouckaert et al., 2014) using both the mtDNA and nuclear data (three independent loci). The cytB and CR gene regions were linked as one single locus (mtDNA), but analyzed as two separate partitions for the site model estimations. Each nuclear locus, GH and S7, was analyzed as a separate locus. A total of 22 individuals were assigned into populations represented by Andean Basins, with the sole exception of the Rapel and Mataquito basins which were merged as one single population given recent evidence of panmixia (Muñoz-Ramírez, Victoriano, & Habit, 2015). The analysis was run specifying models of molecular evolution based on results from Modeltest-NG (Darriba et al., 2020). The strict clock model was selected over the relaxed clock model via the Bayes Factor method (Kass & Raftery, 1995) after obtaining the marginal likelihoods by the Path Sampling method (Baele, Li, Drummond, Suchard, & Lemey, 2012). A Yule prior was used for the species tree. To estimate dates for each node we used a calibration point for the divergence between the Chubut and the Baker populations based on geological evidence for the reversal of the Chubut Basin after ice-dam collapse between 12.6–11.7 ka (Benito & Thorndycraft, 2020). A normal distribution with mean 0.01215 ka and a standard deviation of 2.5E-4 ka, with fixed monophyly for the Chubut-Baker clade, was used as prior. The MCMC chains were run for 500 million generations, sampling every 50,000 generations to produce 10,000 sampling values. Parameter estimates were checked for convergence in Tracer v1.4 (Rambaut and Drummond, 2007), discarding the first 20% of the samplings as burn-in. The species tree was finally obtained by summarizing the sampled trees in the TreeAnnotator application removing the first 20 % of trees as burn-in.

2.4 Species Delimitation and DNA diagnosis

To test whether the Bueno Basin and the Biobío Basin represent parts of known species or new species, we use a recently developed speciation-based approach called Delineate (Sukumaran, Holder, & Knowles, 2021). Classical approaches to species delimitation based on the Multispecies Coalescent (MSC) identify disruptions in Wright-Fisher panmixia due to gene flow barriers, and assume that these disruptions are explained by species boundaries (Yang & Rannala 2010). As has been noted (Sukumaran & Knowles 2017), however, if there is within-species lineage structuring, then, as long as there is sufficient power available with the data sampled, these approaches will result in conflation of population units with species units, resulting in spurious species boundaries as distinct populations are mis-characterized as distinct species. In contrast, while DELINEATE relies on the Multispecies Coalescent (MSC) as well, it only uses it to first diagnose population units. It then infers the actual species boundaries using an explicit probabilistic speciation process model that organizes these population units into higher-level species units. The DELINEATE approach accounts for the possibility that some structure does not involve speciation events by modeling the formation of population lineages and their subsequent development into independent species as separate processes. DELINEATE uses information about the current taxonomy of the taxa as understood by the investigator to calibrate the model. Specifically, the investigator provides species identities for a subset of the population lineages being considered. These “known” species identities for this subset of population are used to infer a tempo of speciation, i.e., the rate at which an independent population lineage evolves into distinct species, for the entire system. With this, the probabilities of different species delimitation configurations can be calculated and ranked, with the species delimitation of the highest probability constituting the maximum likelihood estimate species delimitation. This estimate of course includes the species assignments for the subset of population lineages as provided by the investigator, but it will also include species assignments for the remaining lineages, i.e., the population lineages of uncertain, unknown, or undetermined species affinities. Each of these population lineages of unknown species affinities will be assigned a species identity. Each of the populations of unknown species affinities may end up being categorized as a population of a species previously identified by the investigator or, alternatively, as a population of “new” unnamed species, previously undisclosed or undeclared by the investigator.

We use the tree inferred by StarBeast3 as the input tree of population lineages required for DELINEATE. For the taxonomic information required, we considered two different schemes. One scheme considered the currently accepted taxonomic view of the family, with three extant species for the Pacific basins and three species for the Atlantic basins (hereafter the Azpelicueta scheme; Azpelicueta, 1994). The second scheme reflects a more conservative taxonomic view that considers all species from Atlantic basins as a single one (hereafter the Arratia scheme; Arratia, 1987). Although the latter is not the prevailing view for the Atlantic taxonomic diversity, we used it to test the robustness of results under a more conservative scenario that would make splitting more difficult (i.e. by decreasing the speciation rate parameter).

Given the uncertainty about the taxonomic status of the Biobío population (Muñoz-Ramírez et al., 2014), only the population from the Imperial Basin was assigned to Diplomystes nahuelbutaensis because the Holotype of this species is from the Imperial Basin, whereas the status of the Biobío population was left to be tested by the analysis. The Baker Basin population is included as part of the Chubut basin, and therefore under the same taxonomic status of Chubut, after Muñoz-Ramírez et al., (2014).

Once the results of the species delimitation analyses were available, molecular diagnostic characters were obtained for DNA diagnosis using the nucDIAG function from the R-package SPIDER (Brown et al. 2012). All gene regions were used, including the cytB and CR sequence alignments and the nuclear gene alignments, although for the CR, a portion of 90 bp (positions 33-122) was not considered because it contained large portions of gaps in most individuals. 

2.5 Estimating Conservation value using phylogenetic diversity

The conservation value of each basin was measured regarding the contribution of the lineage or species present at a given basin to the overall phylogenetic diversity (PD) of the entire family. This was obtained by calculating the proportion of PD that is lost when a given basin population is removed, repeating this procedure for each basin. PD and complementary calculations were conducted using the caper R-package (Orme et al. 2013) on the population tree obtained with StarBeast3. Coastal basins were not included given their comparatively small size and because the two coastal basins harboring Diplomystids are historically part of the larger Andean Biobío basin. 

Usage notes

These data contain different types of files:

  1. Sequence alignments for molecular diagnosis
  2. script with functions to conduct phylogenetic diversity analysis
  3. XML files to run starBeast analysis

Here is a brief description of each file included in this data:

cr_cb_gh_s7_monophyly_prior.xml: 
Input file to run beast, using the starBEAST3 package to estimate a dated species tree. The file is set to force the monophyly of a pair of populations (Baker and Chubut) (monophyly prior).

cr_cb_gh_s7_no_monophyly_prior.xml: 
Input file to run beast, using the starBEAST3 package to estimate a dated species tree. The monophyly of a pair of populations (Baker and Chubut) is not constrained (no monophyly prior).

211_Diplos_RC.fas: 
Alignment file with 211 sequences of a portion of the mitochondrial control region representing individuals of the catfish family Diplomystidae from 15 river basins. This file is used for the molecular diagnosis analysis, called from the R script file Diagnostic_nucleotides.R.

212_Diplos_CytB.fas: 
Alignment file with 212 sequences of a portion of the mitochondrial cytochrome b gene representing individuals of the catfish family Diplomystidae from 15 river basins. This file is used in the molecular diagnosis analysis, called from the R script file Diagnostic_nucleotides.R.

S723.fas: 
Alignment file with 23 sequences of a portion of the nuclear S7 gene (intron 1) representing individuals of the catfish family Diplomystidae from 15 river basins. This file is used in the molecular diagnosis analysis, called from the R script file Diagnostic_nucleotides.R.

GH23.fas: 
Alignment file with 23 sequences of portions of the nuclear growth hormone (GH) gene representing individuals of the catfish family Diplomystidae from 15 river basins. This file is used in the molecular diagnosis analysis, called from the R script file Diagnostic_nucleotides.R.

Diagnostic_nucleotides.R: 
R script file with commands used to run the molecular diagnostic analyses. 

vouchers_for_DiagNuc_CR.csv: 
Data frame containing grouping factors for the control region sequence alignments needed to run the DNA diagnostic analyses in the R script Diagnostic_nucleotides.R

vouchers_for_DiagNuc_cytb.csv: 
Data frame containing grouping factors for the cytochrome b sequence alignment needed to run the DNA diagnostic analyses in the R script Diagnostic_nucleotides.R

nucleares_vouchers.csv: 
Data frame containing grouping factors for the nuclear sequence alignment needed to run the DNA diagnostic analyses in the R script Diagnostic_nucleotides.R

Funding

Universidad Metropolitana de Ciencias de la Educacion, Award: DIUMCE 11-2021-PGI