Data from: The genomic basis of adaptive leaf variation in the Galápagos giant daisies
Data files
Mar 23, 2026 version files 3.78 GB
-
AllSpecies_LD.summary.zip
982.75 MB
-
dxy_4foldSites_SpeciesInfo.csv
2.50 KB
-
dxy.zip
6.95 KB
-
LD_perSpecies.zip
2.09 GB
-
Network_final.cys
16.60 MB
-
only_4foldSites.zip
294.61 MB
-
README.md
21.94 KB
-
SourceData_Figure.xlsx
394.49 MB
-
TajimasD.zip
430.59 KB
Abstract
Scalesia (Asteraceae) is the largest endemic plant genus of the Galápagos archipelago and an example of adaptive radiation. While Scalesia species are highly varied in habit and morphology, most remarkable is their variety of leaf shapes, especially in the differential presence of leaf lobing/serration, a derived trait that evolved multiple times as a likely adaptation to the islands’ hot and dry equatorial climate. Using population-level genomic data from 396 individuals representing all 15 recognized Scalesia species, we characterize this young radiation (around 1 million years ago), and reveal that their substantial morphological divergence and ecological specialization are primarily based on shared genetic variation. To further elucidate the repeated adaptive evolution of leaf lobing in Scalesia, we integrate genomic and leaf morphometric data, with transcriptomes from different developmental stages, and conclude that leaf lobing evolved through diversifying selection. Natural selection occurs independently on different regulators in the pathway controlling the development of adaxial-abaxial leaf polarity, highlighting the importance of the founder populations’ high genetic diversity maintained via allopolyploidy. Finally, our findings have implications for the conservation of Scalesia’s threatened biodiversity, as unexpectedly high intra-specific genetic structure and long-term isolation among populations indicate widespread nascent speciation. This dataset contains files associated with the article. Specifically, it contains code and scripts used to analyse the data, the source data files for the main text and supplementary figures, and the Cytoscape file used for the transcriptomics analysis in the article. It also contains example input and output files to calculate dxy and genome-wide Tajima's D.
Dataset DOI: 10.5061/dryad.j9kd51cr0
Description of the data and file structure
Files and variables
File: ScalesiaCode.sh
Description: This file contains bash code that calls external programs used for the analysis. All command-line arguments are included. The program version used can either be found in this file or in the methods section of the manuscript. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).
Programs/tools used: angsd v0.935, angsd v.0.941, Plink v1.9, java 1.8, GATK v3.7-0, bedtools v2.30, PCangsd v0.98, NGSadmix32, bedtools v2.26.0, samtools v. 1.10, GATK v4.2.3, vcftools v0.1.17, picardtools v2.25.5, Dsuite, bcftools v.1.10
File: Network_final.cys
Description: Cytoscape file used for the transcriptome network analysis.
File: dxy.zip
Description: Example input and output data to calculate dxy between species/population pairs. The input data is a small subset of the original data in order to facilitate running/testing the script on a normal laptop computer within a few minutes. The R script Calculate_dxy_subset.R was used to generate the example output. The R script is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).
Programs/tools used: R v4.3.1
Variables .mafs files:
- chromo: Chromosome ID (same as in the fasta file used for mapping)
- position: Position on the chromosome in base pairs (bp)
- major: Major allele (C, T, A, or G) for given position
- minor: Minor allele (C, T, A, or G) for given position
- ref: Reference allele (C, T, A, or G) for given position
- knownEM: Allele requency using -doMaf 1 option in angsd
- unknownEM: Allele frequency using -doMaf 2 option in angsd
- nInd: Number of individuals with data for given position
Variables in .csv:
- pop.A: .mafs input file for population A
- pop.B: .mafs input file for population A
- Global.dxy: Sum of dxy for population pair across all sites analysed
- Global.per.site.Dxy: Global.dxy divided by the total number of sites analysed
File: TajimasD.zip
Description: Folder containing example input and output data to calculate Tajima's D, theta, and pi. The R script TajimasD_nucleotideDiversity_subset.R, available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669), was used to generate the example output. The input data is a small subset of the original data in order to facilitate running/testing the script on a normal laptop computer within a few minutes.
Programs/tools used: R v4.3.1, R package dplyr v1.1.4, R package ggplot2 v3.5.2
Variables in .4foldSItes_print.sub:
- Chromo: Chromosome ID (same as in the fasta file used for mapping)
- Pos: Position on the chromosome in base pairs (bp)
- Watterson: Watterson theta
- Pairwise: Pairwise theta (nucleotide diversity)
- thetaSingleton: theta (singleton category)
- thetaH: thetaH
- thetaL: thetaL
Variables in TajimasD_popInfo.csv:
- species: species/population
- sampleSize: Number of samples included in the analysis for a given species
Variables in TajimasD_genomeWide_output.csv:
- species: species/population
- sampleSize: Number of samples included in the analysis for a given species
- Tajima: Tajima's D for the given species
- Tajima_CI_2.5: Lower bound of the quantile-based 95% confidence interval for Tajima's D, based on 1,000 bootstraps
- Tajima_CI_97.5: Upper bound of the quantile-based 95% confidence interval for Tajima's D, based on 1,000 bootstraps
- thetaW_perSite: Per-site Watterson's theta
- thetaW_perSite_CI_2.5: Lower bound of the quantile-based 95% confidence interval for the per-site Watterson's theta, based on 1,000 bootstraps
- thetaW_perSite_CI_97.5: Upper bound of the quantile-based 95% confidence interval for the per-site Watterson's theta, based on 1,000 bootstraps
- pi_perSite: Per-site estimate of pi
- pi_perSite_CI_2.5: Lower bound of the quantile based 95% confidence interval for per-site estimate of pi, based on 1,000 bootstraps
- pi_perSite_CI_97.5: Upper bound of the quantile based 95% confidence interval per-site estimate of pi based on 1,000 bootstraps
File: TajimasD_nucleotideDiversity_subset.R
Description: Script used to calculate Tajima's D, theta, and pi with the example input data in the TajimasD.zip folder, which contains a subset of the original data. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).
Programs/tools used: R v4.3.1
File: Calculate_dxy_subset.R
Description: Script used to generate example output in the dxy folder, which contains a subset of the original data. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).
Programs/tools used: R v4.3.1
File: Scalesia_plotLDdecay.R
Description: R script to calculate LD decay from the whole dataset containing all species. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).
Programs/tools used: R v4.3.1, R package dplyr v1.1.4, R package ggplot2 v3.5.2
File: Scalesia_plotLDdecay_perSpecies.R
Description: R script to calculate LD decay for each species separately. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).
Programs/tools used: R v4.3.1, R package dplyr v1.1.4, R package ggplot2 v3.5.2
File: TajimasD_theta_perPop.R
Description: R script to calculate Tajima's D, theta, and pi per population. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).
Programs/tools used: R v4.3.1, R package dplyr v1.1.4, R package ggplot2 v3.5.2, R package optparse v1.7.5
File: Plot_thetaPerPop.Rmd
Description: R markdown file to plot Tajima's D, theta, and pi per population. This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).
Programs/tools used: R v4.3.1, R package dplyr v1.1.4, R package ggplot2 v3.5.2, R package ggpubr v0.6.1
File: only_4foldSites.zip
Description: Input files to calculate dxy for neutral sites (4-fold degenerate sites).
Variables .mafs files:
- chromo: Chromosome ID (same as in the fasta file used for mapping)
- position: Position on the chromosome in base pairs (bp)
- major: Major allele (C, T, A, or G) for given position
- minor: Minor allele (C, T, A, or G) for given position
- ref: Reference allele (C, T, A, or G) for given position
- knownEM: Allele requency using -doMaf 1 option in angsd
- unknownEM: Allele frequency using -doMaf 2 option in angsd
- nInd: Number of individuals with data for given position
File: calculate_dxy_4foldSites.R
Description: R script to calculate dxy for all species pairs using only neutral sites (4-fold degenerate sites). This file is available in the linked Zenodo publication (https://doi.org/10.5281/zenodo.18865669).
Programs/tools used: R v4.3.1, R package ggplot2 v3.5.2
File: dxy_4foldSites_SpeciesInfo.csv
Description: Input file to be used with the R script calculate_dxy_4foldSites.R. The file contains empty columns that will be filled by the R script.
- pop.A: .mafs input file for population A
- pop.B: .mafs input file for population A
- Global.dxy: Sum of dxy for population pair across all sites analysed (empty, to be estimated by calculate_dxy_4foldSites.R).
- Global.per.site.Dxy..using.number.of.sites.after.merging.: Global.dxy divided by the total number of sites analysed (empty, to be estimated by calculate_dxy_4foldSites.R).
- Global.per.site.Dxy..using.number.of.sites.in.the.sites.file.: Global.dxy divided by the total number of 4-fold sites in the genome (empty, to be estimated by calculate_dxy_4foldSites.R).
File: LD_perSpecies.zip
Description: Source data for the LD decay per species supplementary figures (Figure S 31 - 56). Each species is represented by a single txt file in this folder. The first column contains the distance in bp and the second column the r2 value.
Variables .ld.summary files:
- Column 1: Distance between SNPs in base pairs [bp]
- Column 2: r2 for SNP pair
File: AllSpecies_LD.summary.zip
Description: Source data for the LD decay of all samples combined supplementary figure (Figure S 30).
Variables .ld.summary files:
- Column 1: Distance between SNPs in base pairs [bp]
- Column 2: r2 for SNP pair
File: SourceData_Figure.xlsx
Description: Source data for main text and supplementary figures. The raw data for each figure/figure panel is represented by a single sheet in this Excel document. As per the Dryad guidelines, geographic locations in this file have been rounded to one degree to protect endangered species/populations.
Column information for sheet "Figure 1c, S2-8":
This data has been used to generate maps with the sampling locations. The precision of the geographic locations in this file has been reduced by rounding to one degree to protect endangered species/populations.
- Population/Herbarium sample: Population ID for the sample populations. In the case of herbarium samples where no population sample was obtained, the sample ID is given
- Species: Name of the species for the given population/sample
- Island: Name of the island the population/sample was collected from
- Latitude & Longitude: Geographic location of the sampled population/sample. The precision of the geographic locations has been reduced in this file to protect endangered species/populations
Column information for sheet "Figure 2a":
This data has been used to generate a PCA plot of the genetic structure of the nuclear genome of Scalesia species.
- sample ID: Sample name for DNA sample
- PCX: Value along the given PC axis for a given sample based on the nuclear genome analysis
Column information for sheet "Figure 2b":
This data has been used to create an UpSet plot that shows the sharing of putatively selected genes between lobed-leaf Scalesia species.
- species: Name of the lobed species for which FST outlier test was performed
- gene: Gene IDs (as in the genome annotation gff file) for genes within/overlapping FST outlier windows for a given lobed species
Column information for sheet "Figure 3a":
This data has been used to generate a PCoA plot of the leaf-morphology measurements of different Scalesia species.
- sampleID: sample ID for the leaf morphology samples
- Species: Species name of the samples
- population: Population ID of the sample
- lobeness: Leaf lobing; "lobed" if a species has a lobed leaf phenotype, "unlobed" if it does not have a lobed leaf phenotype
- PCX: Value along the given PC axis for a given sample based on the leaf morphology
- distanceToMean (PC1-PC2): Distance to the mean value for a species/population based on PC1 and PC2
Column information for sheet "Figure 3b,c":
This data has been used to generate box plots of the leaf morphology measurements for lobed-leaved and unlobed Scalesia species.
- LeafArea: sample ID for the leaf morphology samples
- Population: Population ID of the given sample
- Species: Species name of the given sample
- Perimeter/BladeLength: Values for leaf morphology measurement "Perimeter" divided by "BladeLength".
- Perimeter/LeafArea: Values for leaf morphology measurement "Perimeter" divided by "LeafArea".
Column information for sheet "Figure S1":
This data has been used to generate a histogram of the mean per-sample sequencing depth across the dataset.
- sampleID: Sample ID for DNA sample
- sequencing depth (S. atractyloides reference genome) after quality filtering (MAPQ 30): Mean sequencing depth for the given sample against the S. atractyloides reference genome after filtering out reads with a mapping quality (MAPQ) below 30
Column information for sheet "Figure S9"
This data has been used to generate a figure with the FST and Dxy values for each species/population pair in the spreadsheet. In the final figure, the lower diagonal shows the FST value and the upper diagonal the Dxy.
- Species/population1: First species or population of the species/population pair used in the FST and Dxy estimate
- Species/population2: Second species or population of the species/population pair used in the FST and Dxy estimate
- global Fst (weighted): Genome-wide FST (autosomal chromosomes only) value, using the weighted value from the angsd output
- Dxy: Estimate of Dxy for the given species pair (4-fold degenerate sites of autosomal chromosomes only)
Column information for sheet "Figure S10"
This data has been used to generate bar plots of each Tajima's D, theta, and pi, including the 95% confidence interval, with one bar per species.
- species: species name
- sampleSize: Number of samples included in the analysis for a given species
- Tajima: Tajima's D for given species
- Tajima_CI_2.5: Lower bound of the quantile-based 95% confidence interval for Tajima's D, based on 1,000 bootstraps
- Tajima_CI_97.5: Upper bound of the quantile-based 95% confidence interval for Tajima's D, based on 1,000 bootstraps
- thetaW_perSite: Per-site Watterson's theta
- thetaW_perSite_CI_2.5: Lower bound of the quantile-based 95% confidence interval for the per-site Watterson's theta, based on 1,000 bootstraps
- thetaW_perSite_CI_97.5: Upper bound of the quantile-based 95% confidence interval for the per-site Watterson's theta, based on 1,000 bootstraps
- pi_perSite: Per-site estimate of pi
- pi_perSite_CI_2.5: Lower bound of the quantile based 95% confidence interval for per-site estimate of pi, based on 1,000 bootstraps
- pi_perSite_CI_97.5: Upper bound of the quantile based 95% confidence interval per-site estimate of pi based on 1,000 bootstraps
Column information for sheet "Figure S11"
This data has been used to generate bar plots of each Tajima's D, theta, and pi, including the 95% confidence interval with one bar per population.
- population: Population ID
- species: Species name
- sampleSize: Number of samples included in the analysis for a given species
- Tajima: Tajima's D for given species
- Tajima_CI_2.5: Lower bound of the quantile-based 95% confidence interval for Tajima's D, based on 1,000 bootstraps
- Tajima_CI_97.5: Upper bound of the quantile-based 95% confidence interval for Tajima's D, based on 1,000 bootstraps
- thetaW_perSite: Per-site Watterson's theta
- thetaW_perSite_CI_2.5: Lower bound of the quantile-based 95% confidence interval for the per-site Watterson's theta, based on 1,000 bootstraps
- thetaW_perSite_CI_97.5: Upper bound of the quantile-based 95% confidence interval for the per-site Watterson's theta, based on 1,000 bootstraps
- pi_perSite: Per-site estimate of pi
- pi_perSite_CI_2.5: Lower bound of the quantile based 95% confidence interval for per-site estimate of pi, based on 1,000 bootstraps
- pi_perSite_CI_97.5: Upper bound of the quantile based 95% confidence interval per-site estimate of pi based on 1,000 bootstraps
Column information for sheets "Figure S1_KX"
This data has been used to generate bar plots showing the nuclear population structure within Scalesia. Assignment to a genetic cluster was estimated with NGSadmix. The final figure contains one row for each K=2 to K=20
- sampleID: Sample ID of DNA sample
- population: Population ID for the given sample
- species: Species name of the sample
- Island: Island the sample was sampled on
- VX: Admixture proportion to genetic cluster X, where X ranges from 1-K with K being the number of ancestral populations used in the NGSadmix model
Column information for sheet "Figure S14":
This data has been used to generate a figure showing a phylogenetic tree and a fbranch statistics. Higher fbranch values indicate high excessive allele sharing between branches.
- branch: Branch ID of the tree
- branch_descendants: Descendants from the given branch
Other column IDs: Tips of the tree. Values: fbranch statistic between the tip and the given branch
Column information for sheet "Figure S15-S24":
This data has been used to generate Manhattan plots with one panel showing the ZFST value between species pairs as well as panels showing Tajima's D and Fay and Wu's H for each species.
- chrom: chromosome ID (1-34, ordered by chromosome length with chromosome 1 being the longest)
- midPos: mid-position of the analysed window
- Tajima'D (X): Tajima's D of the species X with lobed leaves in the given window, with X being S. retroflexa in Table S15, S. cfr. retroflexa in Table S16, S. helleri in Table S17, S. helleri (Santa Cruz population) in Table S18, S. helleri (Santa Fe population) in Table S19, S. incisa in Table S20, S. divisa in Table S21, S. divisa x incisa in Table S22, S. baurii ssp. hopkinsii in Table S23, and S. baurii ssp. baurii in Table S24
- Fay and Wu's H (X): Fay and Wu's H of the species X with lobed leaves in the given window, with X being S. retroflexa in Table S15, S. cfr. retroflexa in Table S16, S. helleri in Table S17, S. helleri (Santa Cruz population) in Table S18, S. helleri (Santa Fe population) in Table S19, S. incisa in Table S20, S. divisa in Table S21, S. divisa x incisa in Table S22, S. baurii ssp. hopkinsii in Table S23, and S. baurii ssp. baurii in Table S24
- Nucleotide Diversity (X): Nucleotide diversity of the species X with lobed leaves in the given window, with X being S. retroflexa in Table S15, S. cfr. retroflexa in Table S16, S. helleri in Table S17, S. helleri (Santa Cruz population) in Table S18, S. helleri (Santa Fe population) in Table S19, S. incisa in Table S20, S. divisa in Table S21, S. divisa x incisa in Table S22, S. baurii ssp. hopkinsii in Table S23, and S. baurii ssp. baurii in Table S24
- Tajima's D (S. crockeri): Tajima's D of S. crockeri in the given window
- Fay and Wu's H (S. crockeri): Fay and Wu's H of S. crockeri in the given window
- Nucleotide Diversity (S. crockeri): Nucleotide Diversity of S. crockeri in the given window
- Fst: FST in given window between S. crockeri and species X
- ZFst: Z score of FST in given window between S. crockeri and species X
- scaffold: Scaffold name as given in the reference fasta file
- start pos: Start position of analysed window
- stop pos: Stop position of analysed window
Column information for sheet "Figure S25":
This data has been used to generate a figure showing significantly enriched GO terms of putatively selected genes within lobed-leaved Scalesia species.
- GO.ID: ID of the GO term
- Term: Description of given GO.ID
- Annotated: Number of annotated genes with the given GO term
- Significant: Number of genes associated with leaf development that are annotated with given GO annotation and are in FST outlier windows in comparisons between lobed and unlobed species
- Expected: Number of genes with the given GO annotation expected by chance to be in FST outlier windows
- p-value: p-value of Fisher's exact test to assess if given GO term is significantly enriched in FST outlier windows
- Gene ratio: Number of leaf development genes in FST outlier windows associated with a given GO term (column "Significant") divided by the total number of leaf development genes in FST outlier windows (43)
Column information for sheet "Figure S56":
This data has been used to generate a barplot showing the genetic structure of Scalesia stewartii, Scalesia atractyloides and S. atractyloides x stewartii hybrids. Assignment to genetic clusters has been estimated with NGSadmix for K=2.
- sampleID: Sample ID of DNA sample
- population: Population ID of DNA sample
- Island: Island the sample was collected from
- species: Species name of DNA sample
- V1: Assignment of DNA sample to genetic cluster 1 in the NGSadmix analysis, only containing S. stewartii, S. atractyloides and S. atractyloides x stewartii
- V2: Assignment of DNA sample to genetic cluster 2 in the NGSadmix analysis, only containing S. stewartii, S. atractyloides and S. atractyloides xstewartii
