Comparative phylogeography of phrynosomatid lizards in Baja California: Asynchronous divergences and expansion of Callisaurus draconoides across the North American deserts

Abstract

This dataset comprises eleven archives containing processed genomic data, analysis inputs/outputs, and supporting files from double-digest RAD sequencing (ddRAD) and target sequence capture (TSC). Data span four genera (Callisaurus, Petrosaurus, Sceloporus, and Urosaurus) and include pyRAD assemblies, Admixture clustering, RAxML phylogenies, BPP coalescent-with-migration models, phyluce TSC assemblies, BEAST2 and StarBEAST2 species trees, and ecoevolity divergence-time inferences across Baja California biogeographic breaks. Files are organized by analysis and genus, with configuration files, job scripts, summary statistics, and outputs in formats such as VCF, structure/geno, Phylip/Nexus alignments, phased allele sequences, etc. Geographic coordinates are provided at reduced precision (0.01°) to limit exact locality disclosure. These data are suitable for reuse in comparative phylogeography, population genomics, and phylogenetic method development, enabling replication of published analyses or testing of alternative models and pipelines. Collection and use of specimens were conducted under relevant permits and institutional animal care protocols (see publication for details), and downstream use should continue to respect applicable legal and ethical standards.

Dataset DOI: 10.5061/dryad.tqjq2bwcb

Summary

This dataset is associated with the article titled "Comparative phylogeography of phrynosomatid lizards in Baja California: asynchronous divergences and expansion of Callisaurus draconoides across the North American deserts", accepted to Journal of Biogeography on September 30, 2025 (DOI: 10.1111/jbi.70075). A preprint is also available here.

We did our best to thoroughly document every analysis presented in the paper to fully enable reproducibility of the key results, but not every intermediate file, log file, or job script is included. If you have any questions or concerns, please contact the corresponding author, Andrew Gottscho (gottschoa@si.edu or andrew.gottscho@gmail.com).

Thank you for your interest in this article and dataset!

Frequently used acronyms

Throughout this Data Dryad package and the associated paper, you will frequently encounter the following acronyms.

BCP = Baja California Peninsula
ddRAD = double-digest Restriction-Associated-DNA sequencing
TSC = Target Sequence Capture

Raw data availability

Raw sequence data (FASTQ format) have been deposited in the NCBI SRA (https://www.ncbi.nlm.nih.gov/sra/PRJNA1242740). These FASTQ data are key inputs to the pyRAD (ddRAD) and phyluce (TSC) pipelines.

Description of the data and file structure

Eleven .zip files are available to download, roughly following the order they are presented in the manuscript.

01_BCP_ddRAD_pyrad.zip

This package contains input files, output files, and statistics associated with the pyrad v3.0.6 pipeline, which was used to process the raw FASTQ data (BCP ddRAD) into a variety of output formats.

There are four top-level directories, corresponding to the four genera included in the study: Callisaurus/, Petrosaurus/, Urosaurus/, and Sceloporus/.

Within each of these top-level directories, there are the following files:

[genus]_params_021017.txt: The input parameters file (the most important file for reproducibility)
step2.job: Example jobscript file for step 2 of the pipeline
step3_[genus].job: Example jobscript file for step 3 of the pipeline
steps5_7_[genus].job: Example jobscript file for steps 5-7 of the pipeline

Within each top-level directory, there are also the following sub-directories:

outfiles/: Please see the pyrad documentation for more details on file formats. Not all output files were used in the manuscript, but are provided here for maximum transparency and utility to anyone interested in reanalyzing this dataset. [n] = the MinCov parameter (minimum samples in a final locus).
- output_[genus]_021017_[n]_h5_p75.alleles: alleles file
- output_[genus]_021017_[n]_h5_p75.excluded_loci: excluded loci
- output_[genus]_021017_[n]_h5_p75.gphocs: GPHOCS format
- output_[genus]_021017_[n]_h5_p75.loci: loci file
- output_[genus]_021017_[n]_h5_p75.nex: nexus format
- output_[genus]_021017_[n]_h5_p75.phy: phylip format
- output_[genus]_021017_[n]_h5_p75.phy.partitions: partitioned phylip format
- output_[genus]_021017_[n]_h5_p75.snps: single nucleotide polymorphisms (SNPs)
- output_[genus]_021017_[n]_h5_p75.snps.geno: SNPs in geno format
- output_[genus]_021017_[n]_h5_p75.str: structure format
- output_[genus]_021017_[n]_h5_p75.unlinked_snps: unlinked SNPs
- output_[genus]_021017_[n]_h5_p75.usnps.geno: unlinked SNPs in geno format
- output_[genus]_021017_[n]_h5_p75.vcf: variant call file
stats/: Refer to the pyrad documentation for more details.
- output_[genus]_021017_[n]_h5_p75.stats: stats summary file (source for Supplementary Table 4)
- s5.consens.txt: stats for step 5
- s3.clusters.txt: stats for step 3
- s2.rawedit.txt: stats for step 2

02_BCP_ddRAD_admixture.zip

This package contains the inputs, outputs, and other relevant files for the Admixture analysis of the BCP ddRAD data, presented in Figures 2-5 in the article. For more information, consult the Admixture website.

There are four top-level directories for each genus: callisaurus/, petrosaurus/, sceloporus/, and urosaurus/. Each directory contains:
- admixture_plot_[genus].R: R script used to plot results
- CVE_values.txt: cross-validation errors, used to determine the optimal K value for each genus
- output_[genus]_021017_[n]_h5_p75.usnps.[k].P: 9 allele frequencies files, one for each K value
- output_[genus]_021017_[n]_h5_p75.usnps.[k].Q: 9 ancestry proportions files, one for each K value.
- output_[genus]_021017_[n]_h5_p75.usnps.geno: The input file in .geno format
CVE_table.csv: A summary of the cross-validation errors for all genera

03_BCP_ddRAD_raxml.zip

This package contains the inputs, outputs, and other files necessary to reproduce the RAxML analyses for the BCP ddRAD data, presented in Figures 2-5 in the article. To learn more, see the RAxML website.

There are four top-level directories for each genus: callisaurus/, petrosaurus/, sceloporus/, and urosaurus/. Each directory contains:
- output_[genus]_021017_[n]_h5_p75.phy: The input file from pyrad
- RAxML_bipartitions.[genus].tre: The output .tre file presented in Figures 2-5
- raxml_[genus].job: Jobscript file used to run the analysis
- raxml_[genus].log: Log file from the analysis

04_BCP_ddRAD_bpp.zip

This package contains input files for Bayesian Phylogenetics and Phylogeography (BPP) analyses conducted under the Multispecies Coalescent with Migration model (MSC-M). The data include phased genomic data, population/species mapping files, and multiple BPP control (.ctl) files for five species complexes. These results are presented in Tables 4-5 and Supplemental Table 6 in the article. For more details, see the BPP repository on github.

Each folder corresponds to a focal species and contains:

A data file (data.txt) with phased allele sequences formatted for BPP
One or more imap files (imap.txt, imap2.txt, etc.) that map individuals to populations or species units used in the MSC-M analyses
One or more BPP control files (bpp1.ctl, bpp2.ctl, etc.), each specifying a different migration scenario

These files can be used to replicate the analyses in the associated publication or adapted for additional analyses of gene flow using the MSC-M framework.

The package contains folders for each species complex:

Petrosaurus/
Sceloporus_magister/
Sceloporus_orcutti/
Urosaurus/
Callisaurus/

Each species complex has the following files:

data.txt
- Phased ddRADseq data formatted for BPP input
- Each data file contains two alleles per individual, and is ready for direct use in MSC-M analyses
imap.txt, imap2.txt, etc.
These files define the mapping of individuals (allele pairs) to species or population units
Multiple imap files are provided where alternative grouping hypotheses were tested
bpp1.ctl, bpp2.ctl, etc.
- Control files for BPP, each specifying model parameters, file paths, and the migration scenario being analyzed
- Each .ctl file corresponds to a specific analysis or migration model (e.g., different gene flow model, different population assignments)

Species folder details:

Petrosaurus/:
- data.txt: Phased alleles for all individuals
- imap.txt, imap2.txt: Two different population/species groupings tested
- bpp1.ctl, bpp2.ctl, bpp3.ctl: Three different migration scenarios
Sceloporus_magister/:
- data.txt: Input data for analysis
- imap.txt: Single grouping
- bpp1.ctl: One migration scenario
Sceloporus_orcutti/:
- data.txt: Input data for analysis
- imap.txt, imap2.txt, imap3.txt: Three different groupings tested
- bpp1.ctl, bpp2.ctl, bpp3.ctl: Three migration scenarios tested
Urosaurus/:
- data.txt: Input data for analysis
- imap.txt, imap2.txt: Two groupings tested
- bpp1.ctl, bpp2.ctl: Two migration scenarios
Callisaurus/:
- data.txt: Input data for analysis
- imap.txt: Single grouping
- bpp1.ctl through bpp9.ctl: Nine different migration models tested

05_BCP_TSC_phyluce.zip

This package contains input files, jobscripts, logs, and a complete set of final output files for the phyluce pipeline, which was used to process the TSC data. For more details, please see the phyluce documentation.

assembly.conf: configuration file used for assembly
illumiprocessor.conf: configuration file used for Illumiprocessor
illumiprocessor.log: log file used for Illumiprocessor
lizard_probes_edit.fasta: probe files used in the TSC workflow
mafft-nexus-internal-trimmed-gblocks-clean-75p/: output files for 75% complete data in nexus format
- 549 nexus files provided, one for each locus, following the format [locus_name].nexus
mafft-nexus-internal-trimmed-gblocks-clean-75p-raxml/: output file for 75% complete data in phylip format
- a single concatenated file is provided
mafft-nexus-internal-trimmed-gblocks-clean-90p/: output files for 90% complete data in nexus format
- 310 nexus files provided, one for each locus, following the format [locus_name].nexus
mafft-nexus-internal-trimmed-gblocks-clean-90p-raxml/: output file for 90% complete data in phylip format
- a single concatenated file is provided
phyluce_assembly_assemblo_trinity.log: log file for phyluce assembly
phyluce_assembly_get_match_counts.log: log file for phyluce assembly
phyluce_assembly_match_contigs_to_probes.log: log file for phyluce assembly
step2_illumiprocessor.job/.log: jobscript/log files for step 2 (Illumiprocessor)
step3_trinity.job/.log: jobscript/log files for step 3 (Trinity)
step4_fasta_lengths.job/.log: jobscript/log files for step 4
step5_assembly_match_contigs_probes.job/.log: jobscript/log files for step 5
step6_get_match_counts_baja.job/.log: jobscript/log files for step 6
step6_get_match_counts.job/.log: jobscript/log files for step 6
step7_get_fastas_from_match_counts.job/.log: jobscript/log files for step 7
step8_explode_get_fastas_file.job/.log: jobscript/log files for step 8
step9_get_fasta_lengths.job/.log: jobscript/log files for step 9
step10_align_seqcap_align.job/.log: jobscript/log files for step 10
step11_get_align_summary_data.job/.log: jobscript/log files for step 11
step12_align_seqcap_align.job/.log: jobscript/log files for step 12
step13_get_gblocks_trimmed_alignments_from_untrimmed.job/.log: jobscript/log files for step 13
step14_get_align_summary_data.job/.log: jobscript/log files for step 14
step15_remove_locus_name_from_nexus_lines.job/.log: jobscript/log files for step 15
step16_get_only_loci_with_min_taxa.job/.log: jobscript/log files for step 16
step17_format_nexus_files_for_raxml.job/.log: jobscript/log files for step 17
taxon-set-baja.conf: taxon set file

06_BCP_TSC_beast2.zip

This package contains input files, selected output files, and jobscripts used to generate a phylogeny of the concatenated BCP TSC data, presented in Supplemental Figure 1 in the article. For more details, see the BEAST2 web page.

mafft-nexus-internal-trimmed-gblocks-clean-75p.phylip: concatenated data used as input, directly from phyluce
baja_TSC_75p.xml: Input file for the analysis
baja_TSC_mafft-nexus-internal-trimmed-gblocks-clean-75p_run1.trees: Trees from the first run (run 1)
baja_TSC_mafft-nexus-internal-trimmed-gblocks-clean-75p_run3.trees: Trees from the second run (run 3)
BEAST_TSC.job: job file used to run BEAST
combined_trees2.trees: combined trees across two runs, after discarding burn-in
max_clade_cred.tre: The final maximum clade consensus tree used to generate Supplemental Figure 1

07_BCP_TSC_starbeast2.zip

This package contains the input files, selected output files, and jobscript used to run the StarBEAST analysis, presented in Figure 6 in the article. For more details, please see the StarBEAST tutorial.

starbeast.job: jobscript file used to run the analysis
combined_species.trees: combined species trees resulting from three independent runs, after discarding burn-in
species_run1.trees: species trees resulting from the first run
species_run3.trees: species trees resulting from the second run
species_run4.trees: species trees resulting from the third run
species.tree: maximum clade consensus tree, presented in Figure 6
SpeciesTreeUCLN_26exons_HKY_500million_2.4.5.xml: input file used to run the analysis

08_BCP_ecoevolity.zip

This archive contains input files for ecoevolity analyses conducted on two types of genomic data: ddRAD and TSC. The data are organized to reflect two separate biogeographic tests across the Baja California peninsula: the La Paz and Vizcaíno biogeographic breaks. These results are presented in Figures 7 & 8 in the article. For more details, see the ecoevolity repository on github.

The base directory contains two main subdirectories:

BCP_ddRAD/
TSC/

Each of these directories contains two subfolders, representing the two biogeographic regions tested:

lapaz/
vizcaino/

Within each biogeographic subfolder (lapaz and vizcaino), there are three key directories:

data/
- Contains the sequence data in NEXUS format. Each file corresponds to a population pair used in the ecoevolity analysis. The filenames include species identifiers and dataset parameters (e.g., filtering thresholds, and region).
Independent_prior/
- Contains a single file: configuration.yml
- This YAML file specifies the ecoevolity run configuration using an independent prior for each divergence event across population pairs.
Shared_prior/
- Contains a single file: configuration.yml
- This YAML file specifies the ecoevolity run configuration using a shared prior across divergence events.

Contents Overview:

Baja_phryno_ecoevolity/
├── ddRAD/
│ ├── lapaz/
│ │ ├── data/
│ │ ├── Independent_prior/
│ │ └── Shared_prior/
│ └── vizcaino/
│ ├── data/
│ ├── Independent_prior/
│ └── Shared_prior/
├── uce/
│ ├── lapaz/
│ │ ├── uce/
│ │ ├── Independent_prior/
│ │ └── Shared_prior/
│ └── vizcaino/
│ ├── uce/
│ ├── independent_prior/
│ ├── shared_prior/

09_callisaurus_rangewide_pyrad.zip

This package contains input files, output files, and statistics associated with the pyrad v3.0.66 pipeline, which was used to process the raw FASTQ data (range-wide Callisaurus, ddRAD) into a variety of output formats.

steps2-7.job: Example jobscript file for steps 2-7 of the pipeline
callisaurus_params_051917.txt: The input parameters file (the most important file for reproducibility)

This package also contains the following directories:

outfiles/: Please see the pyrad documentation for more details on file formats. Not all output files were used in the manuscript, but are provided here for maximum transparency and utility to anyone interested in reanalyzing this dataset.
- output_callisaurus_051917_n102_h5_p75.alleles: alleles file
- output_callisaurus_051917_n102_h5_p75.excluded_loci: excluded loci
- output_callisaurus_051917_n102_h5_p75.gphocs: GPHOCS format
- output_callisaurus_051917_n102_h5_p75.loci: loci file
- output_callisaurus_051917_n102_h5_p75.nex: nexus format
- output_callisaurus_051917_n102_h5_p75.phy: phylip format
- output_callisaurus_051917_n102_h5_p75.phy.partitions: partitioned phylip format
- output_callisaurus_051917_n102_h5_p75.snps: single nucleotide polymorphisms (SNPs)
- output_callisaurus_051917_n102_h5_p75.snps.geno: SNPs in geno format
- output_callisaurus_051917_n102_h5_p75.str: structure format
- output_callisaurus_051917_n102_h5_p75.unlinked_snps: unlinked SNPs
- output_callisaurus_051917_n102_h5_p75.usnps.geno: unlinked SNPs in geno format
- output_callisaurus_051917_n102_h5_p75.vcf: variant call file
- An identical set of files with the suffix _ex_outgr mirrors the files above, but excludes the outgroups (Holbrookia).
stats/: Refer to the pyrad documentation for more details.
- output_callisaurus_051917_n102_h5_p75_ex_outgr.stats: stats summary file
- output_callisaurus_051917_n102_h5_p75.stats
- s2.rawedit.txt: stats for step 2
- s3.clusters.txt: stats for step 3
- s5.consens.txt: stats for step 5

10_callisaurus_rangewide_admixture.zip

This package contains the inputs, outputs, and other relevant files for the Admixture analysis of the range-wide Callisaurus ddRAD data, presented in Figure 9 in the article. For more information, consult the Admixture website.

data_conversion/: As described in the article, plink and PGDSpider were used to convert data from pyRAD into a format usable by the newer version of Admixture (v1.3.0).
- output_callisaurus_051917_n102_h5_p75_ex_outgr_plink.log: log file from plink
- output_callisaurus_051917_n102_h5_p75_ex_outgr_plink.nosex: a PLINK output that flags individuals with missing or ambiguous sex information
- output_callisaurus_051917_n102_h5_p75_ex_outgr.map: SNP map information (chromosome, SNP ID, position)
- output_callisaurus_051917_n102_h5_p75_ex_outgr.ped: genotype data in a large text table (individual IDs + genotypes)
- plink-calli_020625.log: log file from plink
- plink.job: jobscript file used to run plink
inputs/
- admix-calli_K5-12.log: log file from Admixture
- admixture.job: jobscript used to run Admixture
- output_callisaurus_051917_n102_h5_p75_ex_outgr_plink.bed: input file for Admixture; binary genotype data
- output_callisaurus_051917_n102_h5_p75_ex_outgr_plink.bim: input file for Admixture; extended SNP map
- output_callisaurus_051917_n102_h5_p75_ex_outgr_plink.fam: input file for Admixture; family/individual information (sample IDs, sex, phenotype)
results_plotting/
- admixture_on_map_callisaurus_020825.R: R script used to plot Admixture results on a map
- admixture_plot_callisaurus_rangewide.R: R script used to generate barplots
- callisaurus_K12_run2.data: output data for K=12
- callisaurus_rangewide_gps.data: GPS coordinates for range-wide Callisaurus. Rounded to a precision of two decimal degrees for the purposes of this archive.
- CVE.csv: cross-validation errors, used to determine optimal K
- output_callisaurus_051917_n102_h5_p75_ex_outgr_plink.[k].P: allele frequencies output by Admixture
- output_callisaurus_051917_n102_h5_p75_ex_outgr_plink.[k].Q: ancestry proportions estimated by Admixture

11_callisaurus_rangewide_raxml.zip

This package contains the inputs, outputs, and other files necessary to reproduce the RAxML analyses for the range-wide Callisaurus ddRAD data, presented in Figure 9 in the article. To learn more, see the RAxML website.

output_callisaurus_051917_n102_h5_p75.phy: The input file from pyrad
RAxML_bipartitions.output_callisaurus_051917_n102_h5_p75_raxml.tre: The output .tre file presented in Figure 9
raxml_callisaurus.log: Log file from the analysis
raxml-jobscript.job: Jobscript file used to run the analysis