Data from: The genomic basis of adaptive leaf variation in the Galápagos giant daisies

Bieker, Vanessa 1 2 ; Li, Siyu3; Cerca, José4 5 6 7; Battlay, Paul8; Falahati Anbaran, Mohsen4; Sharma, Amit6; Jaramillo Díaz, Patricia9 10; Fernández-Mazuecos, Mario11 12; Ramos-Madrigal, Jazmín13; Martin, Sarah L. F.4; Santos-Bay, Luisa13; Petersen, Gitte14; Seberg, Ole15; Vargas, Pablo12; Nielsen, Rasmus16; Gilbert, M. Thomas P.4 13; Rivas-Torres, Gonzalo17 18; Leebens-Mack, James19; Rieseberg, Loren H.20; Nielsen, Lene R.13; Sinha, Neelima3; Martin, Michael D.4

Published Mar 23, 2026 on Dryad. https://doi.org/10.5061/dryad.j9kd51cr0

Data files

Mar 23, 2026 version files 3.78 GB

AllSpecies_LD.summary.zip

982.75 MB
dxy_4foldSites_SpeciesInfo.csv

2.50 KB
dxy.zip

6.95 KB
LD_perSpecies.zip

2.09 GB
Network_final.cys

16.60 MB
only_4foldSites.zip

294.61 MB
README.md

21.94 KB
SourceData_Figure.xlsx

394.49 MB
TajimasD.zip

430.59 KB

Abstract

Scalesia (Asteraceae) is the largest endemic plant genus of the Galápagos archipelago and an example of adaptive radiation. While Scalesia species are highly varied in habit and morphology, most remarkable is their variety of leaf shapes, especially in the differential presence of leaf lobing/serration, a derived trait that evolved multiple times as a likely adaptation to the islands’ hot and dry equatorial climate. Using population-level genomic data from 396 individuals representing all 15 recognized Scalesia species, we characterize this young radiation (around 1 million years ago), and reveal that their substantial morphological divergence and ecological specialization are primarily based on shared genetic variation. To further elucidate the repeated adaptive evolution of leaf lobing in Scalesia, we integrate genomic and leaf morphometric data, with transcriptomes from different developmental stages, and conclude that leaf lobing evolved through diversifying selection. Natural selection occurs independently on different regulators in the pathway controlling the development of adaxial-abaxial leaf polarity, highlighting the importance of the founder populations’ high genetic diversity maintained via allopolyploidy. Finally, our findings have implications for the conservation of Scalesia’s threatened biodiversity, as unexpectedly high intra-specific genetic structure and long-term isolation among populations indicate widespread nascent speciation. This dataset contains files associated with the article. Specifically, it contains code and scripts used to analyse the data, the source data files for the main text and supplementary figures, and the Cytoscape file used for the transcriptomics analysis in the article. It also contains example input and output files to calculate dxy and genome-wide Tajima's D.

Dataset DOI: 10.5061/dryad.j9kd51cr0

Description of the data and file structure

Files and variables

File: ScalesiaCode.sh

Description: This file contains bash code that calls external programs used for the analysis. All command-line arguments are included. The program version used can either be found in this file or in the methods section of the manuscript. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).

Programs/tools used: angsd v0.935, angsd v.0.941, Plink v1.9, java 1.8, GATK v3.7-0, bedtools v2.30, PCangsd v0.98, NGSadmix32, bedtools v2.26.0, samtools v. 1.10, GATK v4.2.3, vcftools v0.1.17, picardtools v2.25.5, Dsuite, bcftools v.1.10

File: Network_final.cys

Description: Cytoscape file used for the transcriptome network analysis.

File: dxy.zip

Description: Example input and output data to calculate dxy between species/population pairs. The input data is a small subset of the original data in order to facilitate running/testing the script on a normal laptop computer within a few minutes. The R script Calculate_dxy_subset.R was used to generate the example output. The R script is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).

Programs/tools used: R v4.3.1

Variables .mafs files:

chromo: Chromosome ID (same as in the fasta file used for mapping)
position: Position on the chromosome in base pairs (bp)
major: Major allele (C, T, A, or G) for given position
minor: Minor allele (C, T, A, or G) for given position
ref: Reference allele (C, T, A, or G) for given position
knownEM: Allele requency using -doMaf 1 option in angsd
unknownEM: Allele frequency using -doMaf 2 option in angsd
nInd: Number of individuals with data for given position

Variables in .csv:

pop.A: .mafs input file for population A
pop.B: .mafs input file for population A
Global.dxy: Sum of dxy for population pair across all sites analysed
Global.per.site.Dxy: Global.dxy divided by the total number of sites analysed

File: TajimasD.zip

Description: Folder containing example input and output data to calculate Tajima's D, theta, and pi. The R script TajimasD_nucleotideDiversity_subset.R, available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669), was used to generate the example output. The input data is a small subset of the original data in order to facilitate running/testing the script on a normal laptop computer within a few minutes.

Programs/tools used: R v4.3.1, R package dplyr v1.1.4, R package ggplot2 v3.5.2

Variables in .4foldSItes_print.sub:

Chromo: Chromosome ID (same as in the fasta file used for mapping)
Pos: Position on the chromosome in base pairs (bp)
Watterson: Watterson theta
Pairwise: Pairwise theta (nucleotide diversity)
thetaSingleton: theta (singleton category)
thetaH: thetaH
thetaL: thetaL

Variables in TajimasD_popInfo.csv:

species: species/population
sampleSize: Number of samples included in the analysis for a given species

Variables in TajimasD_genomeWide_output.csv:

species: species/population
sampleSize: Number of samples included in the analysis for a given species
Tajima: Tajima's D for the given species
Tajima_CI_2.5: Lower bound of the quantile-based 95% confidence interval for Tajima's D, based on 1,000 bootstraps
Tajima_CI_97.5: Upper bound of the quantile-based 95% confidence interval for Tajima's D, based on 1,000 bootstraps
thetaW_perSite: Per-site Watterson's theta
thetaW_perSite_CI_2.5: Lower bound of the quantile-based 95% confidence interval for the per-site Watterson's theta, based on 1,000 bootstraps
thetaW_perSite_CI_97.5: Upper bound of the quantile-based 95% confidence interval for the per-site Watterson's theta, based on 1,000 bootstraps
pi_perSite: Per-site estimate of pi
pi_perSite_CI_2.5: Lower bound of the quantile based 95% confidence interval for per-site estimate of pi, based on 1,000 bootstraps
pi_perSite_CI_97.5: Upper bound of the quantile based 95% confidence interval per-site estimate of pi based on 1,000 bootstraps

File: TajimasD_nucleotideDiversity_subset.R

Description: Script used to calculate Tajima's D, theta, and pi with the example input data in the TajimasD.zip folder, which contains a subset of the original data. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).

Programs/tools used: R v4.3.1

File: Calculate_dxy_subset.R

Description: Script used to generate example output in the dxy folder, which contains a subset of the original data. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).

Programs/tools used: R v4.3.1

File: Scalesia_plotLDdecay.R

Description: R script to calculate LD decay from the whole dataset containing all species. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).

Programs/tools used: R v4.3.1, R package dplyr v1.1.4, R package ggplot2 v3.5.2

File: Scalesia_plotLDdecay_perSpecies.R

Description: R script to calculate LD decay for each species separately. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).

Programs/tools used: R v4.3.1, R package dplyr v1.1.4, R package ggplot2 v3.5.2

File: TajimasD_theta_perPop.R

Description: R script to calculate Tajima's D, theta, and pi per population. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).

Programs/tools used: R v4.3.1, R package dplyr v1.1.4, R package ggplot2 v3.5.2, R package optparse v1.7.5

File: Plot_thetaPerPop.Rmd

Description: R markdown file to plot Tajima's D, theta, and pi per population. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).

Programs/tools used: R v4.3.1, R package dplyr v1.1.4, R package ggplot2 v3.5.2, R package ggpubr v0.6.1

File: only_4foldSites.zip

Description: Input files to calculate dxy for neutral sites (4-fold degenerate sites).

Variables .mafs files:

chromo: Chromosome ID (same as in the fasta file used for mapping)
position: Position on the chromosome in base pairs (bp)
major: Major allele (C, T, A, or G) for given position
minor: Minor allele (C, T, A, or G) for given position
ref: Reference allele (C, T, A, or G) for given position
knownEM: Allele requency using -doMaf 1 option in angsd
unknownEM: Allele frequency using -doMaf 2 option in angsd
nInd: Number of individuals with data for given position

File: calculate_dxy_4foldSites.R

Description: R script to calculate dxy for all species pairs using only neutral sites (4-fold degenerate sites). This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).

Programs/tools used: R v4.3.1, R package ggplot2 v3.5.2

File: dxy_4foldSites_SpeciesInfo.csv

Description: Input file to be used with the R script calculate_dxy_4foldSites.R. The file contains empty columns that will be filled by the R script.

pop.A: .mafs input file for population A
pop.B: .mafs input file for population A
Global.dxy: Sum of dxy for population pair across all sites analysed (empty, to be estimated by calculate_dxy_4foldSites.R).
Global.per.site.Dxy..using.number.of.sites.after.merging.: Global.dxy divided by the total number of sites analysed (empty, to be estimated by calculate_dxy_4foldSites.R).
Global.per.site.Dxy..using.number.of.sites.in.the.sites.file.: Global.dxy divided by the total number of 4-fold sites in the genome (empty, to be estimated by calculate_dxy_4foldSites.R).

File: LD_perSpecies.zip

Description: Source data for the LD decay per species supplementary figures (Figure S 31 - 56). Each species is represented by a single txt file in this folder. The first column contains the distance in bp and the second column the r² value.

Variables .ld.summary files:

Column 1: Distance between SNPs in base pairs [bp]
Column 2: r² for SNP pair

File: AllSpecies_LD.summary.zip

Description: Source data for the LD decay of all samples combined supplementary figure (Figure S 30).

Variables .ld.summary files:

Column 1: Distance between SNPs in base pairs [bp]
Column 2: r² for SNP pair

File: SourceData_Figure.xlsx

Description: Source data for main text and supplementary figures. The raw data for each figure/figure panel is represented by a single sheet in this Excel document. As per the Dryad guidelines, geographic locations in this file have been rounded to one degree to protect endangered species/populations.

Column information for sheet "Figure 1c, S2-8":

This data has been used to generate maps with the sampling locations. The precision of the geographic locations in this file has been reduced by rounding to one degree to protect endangered species/populations.

Population/Herbarium sample: Population ID for the sample populations. In the case of herbarium samples where no population sample was obtained, the sample ID is given
Species: Name of the species for the given population/sample
Island: Name of the island the population/sample was collected from
Latitude & Longitude: Geographic location of the sampled population/sample. The precision of the geographic locations has been reduced in this file to protect endangered species/populations

Column information for sheet "Figure 2a":

This data has been used to generate a PCA plot of the genetic structure of the nuclear genome of Scalesia species.

sample ID: Sample name for DNA sample
PCX: Value along the given PC axis for a given sample based on the nuclear genome analysis

Column information for sheet "Figure 2b":

This data has been used to create an UpSet plot that shows the sharing of putatively selected genes between lobed-leaf Scalesia species.

species: Name of the lobed species for which F_ST outlier test was performed
gene: Gene IDs (as in the genome annotation gff file) for genes within/overlapping F_ST outlier windows for a given lobed species

Column information for sheet "Figure 3a":

This data has been used to generate a PCoA plot of the leaf-morphology measurements of different Scalesia species.

sampleID: sample ID for the leaf morphology samples
Species: Species name of the samples
population: Population ID of the sample
lobeness: Leaf lobing; "lobed" if a species has a lobed leaf phenotype, "unlobed" if it does not have a lobed leaf phenotype
PCX: Value along the given PC axis for a given sample based on the leaf morphology
distanceToMean (PC1-PC2): Distance to the mean value for a species/population based on PC1 and PC2

Column information for sheet "Figure 3b,c":

This data has been used to generate box plots of the leaf morphology measurements for lobed-leaved and unlobed Scalesia species.

LeafArea: sample ID for the leaf morphology samples
Population: Population ID of the given sample
Species: Species name of the given sample
Perimeter/BladeLength: Values for leaf morphology measurement "Perimeter" divided by "BladeLength".
Perimeter/LeafArea: Values for leaf morphology measurement "Perimeter" divided by "LeafArea".

Column information for sheet "Figure S1":

This data has been used to generate a histogram of the mean per-sample sequencing depth across the dataset.

sampleID: Sample ID for DNA sample
sequencing depth (S. atractyloides reference genome) after quality filtering (MAPQ 30): Mean sequencing depth for the given sample against the S. atractyloides reference genome after filtering out reads with a mapping quality (MAPQ) below 30

Column information for sheet "Figure S9"

This data has been used to generate a figure with the F_ST and D_xy values for each species/population pair in the spreadsheet. In the final figure, the lower diagonal shows the F_ST value and the upper diagonal the D_xy.

Species/population1: First species or population of the species/population pair used in the F_ST and D_xy estimate
Species/population2: Second species or population of the species/population pair used in the F_ST and D_xy estimate
global Fst (weighted): Genome-wide F_ST (autosomal chromosomes only) value, using the weighted value from the angsd output
Dxy: Estimate of D_xy for the given species pair (4-fold degenerate sites of autosomal chromosomes only)

Column information for sheet "Figure S10"

This data has been used to generate bar plots of each Tajima's D, theta, and pi, including the 95% confidence interval, with one bar per species.

species: species name
sampleSize: Number of samples included in the analysis for a given species
Tajima: Tajima's D for given species
Tajima_CI_2.5: Lower bound of the quantile-based 95% confidence interval for Tajima's D, based on 1,000 bootstraps
Tajima_CI_97.5: Upper bound of the quantile-based 95% confidence interval for Tajima's D, based on 1,000 bootstraps
thetaW_perSite: Per-site Watterson's theta
thetaW_perSite_CI_2.5: Lower bound of the quantile-based 95% confidence interval for the per-site Watterson's theta, based on 1,000 bootstraps
thetaW_perSite_CI_97.5: Upper bound of the quantile-based 95% confidence interval for the per-site Watterson's theta, based on 1,000 bootstraps
pi_perSite: Per-site estimate of pi
pi_perSite_CI_2.5: Lower bound of the quantile based 95% confidence interval for per-site estimate of pi, based on 1,000 bootstraps
pi_perSite_CI_97.5: Upper bound of the quantile based 95% confidence interval per-site estimate of pi based on 1,000 bootstraps

Column information for sheet "Figure S11"

This data has been used to generate bar plots of each Tajima's D, theta, and pi, including the 95% confidence interval with one bar per population.

population: Population ID
species: Species name
sampleSize: Number of samples included in the analysis for a given species
Tajima: Tajima's D for given species
Tajima_CI_2.5: Lower bound of the quantile-based 95% confidence interval for Tajima's D, based on 1,000 bootstraps
Tajima_CI_97.5: Upper bound of the quantile-based 95% confidence interval for Tajima's D, based on 1,000 bootstraps
thetaW_perSite: Per-site Watterson's theta
thetaW_perSite_CI_2.5: Lower bound of the quantile-based 95% confidence interval for the per-site Watterson's theta, based on 1,000 bootstraps
thetaW_perSite_CI_97.5: Upper bound of the quantile-based 95% confidence interval for the per-site Watterson's theta, based on 1,000 bootstraps
pi_perSite: Per-site estimate of pi
pi_perSite_CI_2.5: Lower bound of the quantile based 95% confidence interval for per-site estimate of pi, based on 1,000 bootstraps
pi_perSite_CI_97.5: Upper bound of the quantile based 95% confidence interval per-site estimate of pi based on 1,000 bootstraps

Column information for sheets "Figure S1_KX"

This data has been used to generate bar plots showing the nuclear population structure within Scalesia. Assignment to a genetic cluster was estimated with NGSadmix. The final figure contains one row for each K=2 to K=20

sampleID: Sample ID of DNA sample
population: Population ID for the given sample
species: Species name of the sample
Island: Island the sample was sampled on
VX: Admixture proportion to genetic cluster X, where X ranges from 1-K with K being the number of ancestral populations used in the NGSadmix model

Column information for sheet "Figure S14":

This data has been used to generate a figure showing a phylogenetic tree and a f_branch statistics. Higher f_branch values indicate high excessive allele sharing between branches.

branch: Branch ID of the tree
branch_descendants: Descendants from the given branch

Other column IDs: Tips of the tree. Values: f_branch statistic between the tip and the given branch

Column information for sheet "Figure S15-S24":

This data has been used to generate Manhattan plots with one panel showing the ZF_ST value between species pairs as well as panels showing Tajima's D and Fay and Wu's H for each species.

chrom: chromosome ID (1-34, ordered by chromosome length with chromosome 1 being the longest)
midPos: mid-position of the analysed window
Tajima'D (X): Tajima's D of the species X with lobed leaves in the given window, with X being S. retroflexa in Table S15, S. cfr. retroflexa in Table S16, S. helleri in Table S17, S. helleri (Santa Cruz population) in Table S18, S. helleri (Santa Fe population) in Table S19, S. incisa in Table S20, S. divisa in Table S21, S. divisa x incisa in Table S22, S. baurii ssp. hopkinsii in Table S23, and S. baurii ssp. baurii in Table S24
Fay and Wu's H (X): Fay and Wu's H of the species X with lobed leaves in the given window, with X being S. retroflexa in Table S15, S. cfr. retroflexa in Table S16, S. helleri in Table S17, S. helleri (Santa Cruz population) in Table S18, S. helleri (Santa Fe population) in Table S19, S. incisa in Table S20, S. divisa in Table S21, S. divisa x incisa in Table S22, S. baurii ssp. hopkinsii in Table S23, and S. baurii ssp. baurii in Table S24
Nucleotide Diversity (X): Nucleotide diversity of the species X with lobed leaves in the given window, with X being S. retroflexa in Table S15, S. cfr. retroflexa in Table S16, S. helleri in Table S17, S. helleri (Santa Cruz population) in Table S18, S. helleri (Santa Fe population) in Table S19, S. incisa in Table S20, S. divisa in Table S21, S. divisa x incisa in Table S22, S. baurii ssp. hopkinsii in Table S23, and S. baurii ssp. baurii in Table S24
Tajima's D (S. crockeri): Tajima's D of S. crockeri in the given window
Fay and Wu's H (S. crockeri): Fay and Wu's H of S. crockeri in the given window
Nucleotide Diversity (S. crockeri): Nucleotide Diversity of S. crockeri in the given window
Fst: F_ST in given window between S. crockeri and species X
ZFst: Z score of F_ST in given window between S. crockeri and species X
scaffold: Scaffold name as given in the reference fasta file
start pos: Start position of analysed window
stop pos: Stop position of analysed window

Column information for sheet "Figure S25":

This data has been used to generate a figure showing significantly enriched GO terms of putatively selected genes within lobed-leaved Scalesia species.

GO.ID: ID of the GO term
Term: Description of given GO.ID
Annotated: Number of annotated genes with the given GO term
Significant: Number of genes associated with leaf development that are annotated with given GO annotation and are in FST outlier windows in comparisons between lobed and unlobed species
Expected: Number of genes with the given GO annotation expected by chance to be in F_ST outlier windows
p-value: p-value of Fisher's exact test to assess if given GO term is significantly enriched in F_ST outlier windows
Gene ratio: Number of leaf development genes in F_ST outlier windows associated with a given GO term (column "Significant") divided by the total number of leaf development genes in F_ST outlier windows (43)

Column information for sheet "Figure S56":

This data has been used to generate a barplot showing the genetic structure of Scalesia stewartii, Scalesia atractyloides and S. atractyloides x stewartii hybrids. Assignment to genetic clusters has been estimated with NGSadmix for K=2.

sampleID: Sample ID of DNA sample
population: Population ID of DNA sample
Island: Island the sample was collected from
species: Species name of DNA sample
V1: Assignment of DNA sample to genetic cluster 1 in the NGSadmix analysis, only containing S. stewartii, S. atractyloides and S. atractyloides x stewartii
V2: Assignment of DNA sample to genetic cluster 2 in the NGSadmix analysis, only containing S. stewartii, S. atractyloides and S. atractyloides xstewartii

Data from: The genomic basis of adaptive leaf variation in the Galápagos giant daisies

Data files

Abstract

README: Data from: The genomic basis of adaptive leaf variation in the Galápagos giant daisies

Description of the data and file structure

Files and variables

File: ScalesiaCode.sh

File: Network_final.cys

File: dxy.zip

File: TajimasD.zip

File: TajimasD_nucleotideDiversity_subset.R

File: Calculate_dxy_subset.R

File: Scalesia_plotLDdecay.R

File: Scalesia_plotLDdecay_perSpecies.R

File: TajimasD_theta_perPop.R

File: Plot_thetaPerPop.Rmd

File: only_4foldSites.zip

File: calculate_dxy_4foldSites.R

File: dxy_4foldSites_SpeciesInfo.csv

File: LD_perSpecies.zip

File: AllSpecies_LD.summary.zip

File: SourceData_Figure.xlsx