Tracing the origins and evolution of nymphalid butterflies (Lepidoptera) in the Atlantic Forest
Data files
Mar 04, 2026 version files 15.03 GB
-
BSM_output.zip
14.69 GB
-
CLADS_output.zip
305.32 MB
-
IQTREE2.2_output.zip
78.04 KB
-
Mafft_Alignments_Partitions_files.zip
30.50 MB
-
MCMCTree_input_trees.zip
36.65 KB
-
MCMCTree_output.zip
236.15 KB
-
README.md
12.14 KB
-
Rscripts.zip
19.90 KB
Abstract
This dataset contains the sequence alignments used to reconstruct time-calibrated phylogenies of Nymphalidae butterflies from the Neotropical region, with a focus on lineages occurring in the Atlantic Forest. The data include alignments for 68 newly sequenced species (43 of which had not been previously included in global phylogenies), combined with existing datasets from Chazot et al. (2021) and Kawahara et al. (2023). Alignments were generated using the BUTTERFLY1.0 probe set targeting 391 loci, including 11 legacy genes commonly used in Lepidoptera systematics. Alignments were performed separately for nine major Neotropical clades of Nymphalidae, using MAFFT and curated to minimize missing data. These sequence data formed the basis for subsequent phylogenetic, divergence time, and biogeographic analyses, and provide a resource for future studies of butterfly evolution and Neotropical biodiversity. In addition, it contains the MCMCtree dated phylogenies inferred in this study and used for the biogeographical analysis.
Dataset DOI: 10.5061/dryad.wm37pvn11
Description of the data and file structure
This dataset contains the data used and generated to study the Atlantic Forest nymphalids' evolutionary history. To generate this data we used bash and R language based software. Users are provided with the R scripts to generate outputs. Missing data are coded as 'N'. Clade numbers coincide with the order of Table 1 in the manuscript. The data include alignments for 68 newly sequenced species (43 of which had not been previously included in global phylogenies), combined with existing datasets from Chazot et al. (2021) and Kawahara et al. (2023). Alignments were generated using the BUTTERFLY1.0 probe set targeting 391 loci, including 11 legacy genes commonly used in Lepidoptera systematics. Alignments were performed separately for nine major Neotropical clades of Nymphalidae, using MAFFT and curated to minimise missing data. These sequence data formed the basis for subsequent phylogenetic, divergence time, and biogeographic analyses and provide a resource for future studies of butterfly evolution and Neotropical biodiversity.
Files and variables
File: Mafft_Alignments_Partitions_files.zip
Description: Molecular alignment and partition files used for phylogenetic inference of the 9 clades reported in Table 1 of the manuscript.
- Alignment files (
.phy): Molecular sequence alignments in PHYLIP format used for IQ-TREE2 topology analyses and MCMCTree calibration analyses. - Partition files (
.partitions): Partition scheme files corresponding to each alignment, used as input for IQ-TREE2 analyses.
Each clade (Table 1 in the manuscript for the 9 clades) has one alignment file and one corresponding partition file. Alignment files can be opened with phylogenetic software such as IQ-TREE2, RAxML, Geneious software or viewed in text editors that support plain text formats. Partitions files are text files. Molecular raw data were derived from short read whole genome sequencing (ncbi BioProject: PRJNA1297423), Chazot et al. (2021) and Kawahara et al. (2023).
File: IQTREE2.2_output.zip
Description: phylogenetic tree files generated from maximum likelihood topological analyses performed using IQ-TREE v2.2.0. The analyses were conducted on the 9 nymphalid clades analyzed in this study (see Table 1 in the manuscript), using the species list provided in Table S1.
- Tree files (
.treefile): One maximum likelihood tree unrooted per clade (9 total files).
Trees were inferred using concatenated alignments (.phy) and corresponding partition files (.partitions) with the following command:
iqtree2-mpi -s ${FILE}.phy -p ${FILE}.partitions -B 1000 --boot-trees --wbtl --alrt 1000 --abayes --bnni -m MFP --merge -T 10 --prefix ${FILE}
Analysis parameters used:
- Model selection using ModelFinder (
-m MFP) - 1000 ultrafast bootstrap replicates (
-B 1000) - 1000 SH-aLRT replicates (
--alrt 1000) - Approximate Bayes support (
--abayes) - Branch length optimization with BNNI (
--bnni) - Partition merging enabled (
--merge)
Tree files can be opened directly with a tree visualization software (e.g., FigTree, iTOL).
File: MCMCTree_input_trees.zip
Description: Phylogenetic tree files used as input for divergence time estimation analyses performed with MCMCTree (part of the PAML package). Topology inference for each tree was first conducted using IQ-TREE v2.2.0. The resulting topologies were then manually edited to incorporate secondary calibration points prior to run MCMCTree calibration analysis (see Methods section and Table S3 of the manuscript for calibration details).
- Tree files (
.nwk): 9 phylogenetic trees in Newick format (.nwk) formatted to use in MCMCTree.
These files can be used directly for MCMCTree analysis or opened directly with a tree visualization software (e.g., FigTree, iTOL).
File: MCMCTree_output.zip
Description: Trees files (.tre) generated with MCMCTree of each subclade (see Table 1 in manuscript) together with posterior mean ages of the nodes (.txt files) for each clade.
- Tree files (
.tre): 18 time calibrated phylogenetic trees corresponding to first and second run of MCMCTree analysis for each clade. - Posterior means files (
.txt): Posterior mean estimates from MCMCTree output in text format.
File: BSM_output.zip
Description: Input files, scripts dependencies, and output objects generated for historical biogeographic analyses performed using the R package BioGeoBEARS (Matzke 2013). The analyses include DEC (Dispersal–Extinction–Cladogenesis) model inference and Biogeographical Stochastic Mapping (BSM). Both constrained and unconstrained dispersal models were implemented. All .RData objects were generated using scripts provided in the folder Rscripts.zip included in this Dryad submission.
Folder Structure:
BSM_output.zip
│
├── ConstrainedAnalysis/
│ └── BSM_out/
│ └── ${Subclade}/ (14 folders)
│
└── Unconstrained/
└── BSM_out/
└── ${Subclade}/ (14 folders)
There are 14 subclade folders in each analysis type (see Table 1 in the manuscript for subclade information).
Constrained vs Unconstrained Analyses: In the constrained analysis, dispersal was allowed only between geographically adjacent areas. In the unconstrained analysis, dispersal between all areas was permitted. Each folder (ConstrainedAnalysis/ and Unconstrained/) contains identical file structures but different model assumptions.
Contents of Each Subclade Folder: e.g.:ConstrainedAnalysis/BSM_out/${Subclade}/
newick/folder: contains 50 ramdomnly sampled.newicktree files. These trees were used as input for the BSM analyses. Trees represent phylogenetic uncertainty incorporated into stochastic mapping. These files can be opened in standard phylogenetic tree visualization software or imported into R.
${subclade}_DEC_actual_phylo.RData: R workspace file containing DEC model fit object, estimated parameters (dispersal, extinction), Log-likelihood and AIC values, and Ancestral range reconstruction results for the DEC model. This object was used as input for subsequent BSM analyses. Load using<R>load("${subclade}_DEC_actual_phylo.RData")
${subclade}_mcmc.tre: Tree file derived from MCMCTree analyses used as the phylogenetic input for DEC modeling.distribution.txt: Geographic distribution matrix for all tip taxa included in the analysis. Each tip is coded according to presence in the following biogeographic areas: outside Neotropics (O); Central America (C); Northern Andes (N); Central Andes (A); Amazonia (Z); Northern Atlantic Forest (F); Southern Atlantic Forest (S); Diagonal of open formations (D). See Figure 1 and material and methods section in the manuscript for more detail information.- BSM Output
.RDataFiles: These files contain stochastic mapping results and event summaries. To open this files and visualize results R scripts from the Rscripts.zip folder can be used (Retrieve_BSM_results_plots.R,BSM_RetrieveResults_QGRAPHS_Circles_plotsandcombine_BSM_CLADS_diversification.rate_plots.R).ana_events_tables.RData: tables summarizing anagenetic events inferred during BSM analyses (dispersal and extinction along branches).clado_events_tables.RData: tables summarizing cladogenetic events.counts_list.RData: aggregated counts of biogeographic events across stochastic maps.
File: CLADS_output.zip
Description: outputs of within-region species diversification analyses performed for each subclade using ClaDS (Maliet et al. 2019; Maliet and Morlot 2022). Analyses were performed for 14 subclades (see Table 1 in the manuscript), estimating branch-specific diversification rates while accounting for incomplete taxon sampling (See Supplementary Table S6 in the online version of the manuscript for tip specific "f" value). These branch-specific rates were subsequently combined with BioGeoBEARS ancestral range estimates from BSM analyses to compute region-specific diversification rates over time (combine_BSM_CLADS_diversification.rate_plots.R script).
output.${subclade}: ClaDS output file (binary). Can be opened and analysed using Julia software.output.${subclade}.RData: ClaDS outputs converted for R. It can be open in R withload("output.${subclade}.RData").lambda_tips/folder: 14 fileslambda_tips_${subclade}.txtcontaining tip-specific speciation rates (λ) inferred from ClaDS for each terminal taxon. Format: plain text table with one tip per row, and corresponding λ value(s).
Files were generated in Julia with the code:
# Macroevolutionary analses from: https://github.com/LPDagallier/Monodoreae_macroevolution
############# Install Julia in my PC
# Run this in the Windows terminal
winget install julia -s msstore
##############
# Within a Julia session, you can install the package by typing
julia> using Pkg
julia> Pkg.add("PANDA")
# PANDA can handle missing species in the phylogeny
# The input "f" in julia is as a vector: [1, 0.5, .., 0.8]
# To calculate this vector, nºspp. in my phylo/nº extant spp.
# The order of the vector is the same as the order in R's ape
# Let's remove the outgroups and ladderize the phylogeny
# In julia
cd("PATH/TO/DIRECTORY")
using PANDA
my_tree = load_tree("${sublcade}\\${subclade}_actinote_mcmc.tre")
output = infer_ClaDS(my_tree, print_state = 100, f = f)
using JLD2
@save "./output.${subclade}" output
save_ClaDS_in_R(output, "./output.${subclade}.Rdata")
# You can also retrieve a previously ran result
using JLD2
@load "./output" output
File: Rscripts.zip
Description: This file contains the R scripts used for:
- DEC and Biogeographical Stochastic Mapping (BSM) analyses (DECmodel&BSM.R)
- Retrieve the results of the BSM analyses and make the plots for Figures 2, S3 and S5 (Retrieve_BSM_results_plots.R)
- Retrieve the BSM results to make Figure 1 (BSM_RetrieveResults_QGRAPHS_Circles_plots.R)
- Combine the BSM results with the diversification analysis results and make Figures 3 and S6 (combine_BSM_CLADS_diversification.rate_plots.R).
Code/software
Partitions schemes and alignment were used to infere tree topologies with IQ-TREE 2.2.0:
iqtree2-mpi -s ${file}.phy -p ${file}.partitions -B 1000 --boot-trees --wbtl --alrt 1000 --abayes --bnni -m MFP --merge -T 10 --prefix clade01
Tree files (.tre) were generated with MCMCTree software from the PAML package v4.10.6 for tree calibration analysis.
BSM and DEC model were run in R version 4.3.3.
To make the figures, we used ggplot2 in R version 4.3.3.
Access information
Other publicly accessible locations of the data:
- SRA raw data will be available in NCBI bioproject PRJNA1297423
- Link to suplemmentary tables are located directly at the main manuscript.
Data was derived from the following sources:
-
Whole genome sequencing (Novogene) and two published phylogenies:
1. Chazot, N., Condamine, F. L., Dudas, G., Peña, C., Kodandaramaiah, U., Matos-Maraví, P., … Wahlberg, N. (2021). Conserved ancestral tropical niche but different continental histories explain the latitudinal diversity gradient in brush-footed butterflies. Nature Communications, 12(1), 5717. https://doi.org/10.1038/s41467-021-25906-8\2. Kawahara, A. Y., Storer, C., Carvalho, A. P. S., Plotkin, D. M., Condamine, F. L., Braga, M. P., … Lohman, D. J. (2023). A global phylogeny of butterflies reveals their evolutionary history, ancestral hosts and biogeographic origins. Nature Ecology & Evolution, 7(6), 903–913. https://doi.org/10.1038/s41559-023-02041-9
