Data from: Weak sperm differentiation in Darwin's finches
Data files
Oct 20, 2025 version files 97.19 MB
-
3paral_runs_combinedLog.txt
204.71 KB
-
Constraints.txt
89 B
-
Fig6_MaxCladeCredibility_tree.tre
8.04 KB
-
HedgesG.txt
1.86 KB
-
individuals5grps.txt
245 B
-
log.errors.txt
167 B
-
Morphometrics.txt
5.30 KB
-
pPCA_Eigenvector_variance.xlsx
10.59 KB
-
pPCAeigen.txt
6.42 KB
-
Prep_and_run_snapp.txt
1.08 KB
-
Prep_Run_admixture.txt
310 B
-
Prun64_3miss2mafNoZ_1.vcf.gz
59.72 MB
-
Prune1st2000_GQ30NoMissNoZ.vcf.gz
36.57 MB
-
qmatrix_2_VAR.txt
5.52 KB
-
qmatrix_3_VAR.txt
8.27 KB
-
qmatrix_4_VAR.txt
11.01 KB
-
qmatrix_5_VAR.txt
13.76 KB
-
qmatrix_6_VAR.txt
16.50 KB
-
qmatrix_7_VAR.txt
19.24 KB
-
qmatrix_8_VAR.txt
21.99 KB
-
README.md
9.85 KB
-
script.R
24.77 KB
-
SmartPCA.txt
1.64 KB
-
snapp.xml
513.75 KB
-
Sperm_length_Thraupidae.txt
3.72 KB
-
Sperm_lengths.txt
10.07 KB
-
tree2.tre
146 B
-
TSL_5groups.csv
118 B
Abstract
The submission contains data sets from a study of sperm evolution in eight species of Darwin's finches from two islands in the Galapagos archipelago. The data sets comprise morphological measurements, sperm size measurements, and genome-wide Single-Nucleotide-Polymorphisms (SNPs). The morphological measurements were used for species confirmation. Sperm size differentiation was analysed in a phylogenetic context where a time-calibrated phylogeny for the species were constructed based on SNP data. The sperm size data were also used to infer the frequency of extrapair paternity. The results indicate that sperm size evolves much more slowly than beak and body size in this radiation, and similar to other songbirds with moderate-to-low levels of extrapair paternity.
List of data files and the associated scripts that were used in the statistical tests and result presentations in the paper "Weak sperm differentiation in Darwin's finches".
Data files
Morphometrics.txt: Data file for the measurements of wing length and beak dimensions of Darwin's finches.
Species: species epithet (suffix "_L" and "_Z" indicate island population, resp. "San Cristobal" and "Santa Cruz")
Island: Name of island (in Galápagos archipelago)
AccNo: Accession number (in the DNA bank of Natural History Museum, University of Oslo) that links all data for an individual
Wing_length: Wing length (mm)
Beak_length: Beak length (from skull) (mm)
Beak_height: Beak height (at nostrils) (mm)
Beak_width: Beak width (at nostrils) (mm)
Sperm_lengths.txt: Data file for the sperm measurements of Darwin's finches.
Genus: Genus epithet
Species: Species epithet
Island: Name of island (in Galápagos archipelago)
AccNo: Accession number (in the DNA bank of Natural History Museum, University of Oslo) that links all data for an individual
TSL: Average total sperm length per individual (micrometer)
HL: Average sperm head length per individual (micrometer)
ML: Average sperm midpiece length per individual (micrometer)
TL: Average sperm tail length (midpiece-free distal end of flagellum) per individual (micrometer)
N: Number of sperm cells measured per individual
SDTSL: Standard deviation of total sperm lengths per individual (micrometer)
Sperm_length_Thraupidae.txt: Data file for measurements of total sperm length for species in the family Thraupidae (incl. Darwin's finches).
Species: Species name
TSL: Total sperm length per individual (micrometer)
Number: Species number (used for sorting of species in the phylogeny)
Prun64_3miss2mafNoZ_1.vcf.gz: Compressed vcf-file that contains 261 754 bi-allelic SNPs from 64 individuals (50 genomes from our own sequences and 14 genomes downloaded from NCBI-SRA). This is the input file for the phylogenetic PCA and the Admixture analyses. SNPs from the Z chromosome and from contigs smaller than 1000 bases were omitted from the genomes for SNP calling. Further details of SNP calling and filtering are given in the paper. For the PCA we used the program smartpca in the smartsnp package (https://christianhuber.github.io/smartsnp/). For the Admixture analysis we used the program ADMIXTURE (https://doi.org/10.1101/gr.094052.109).
pPCAeigen.txt: Data file with the eigenvalues for 10 principal components from the phylogenetic PCA (Figure 5AB in the paper) of 261 754 SNPs from 64 Darwin's finches.
SampleID: Sample identity. For the first 50 samples, the first six digits denote the Accession number (in the DNA bank of Natural History Museum, University of Oslo) and the last digits reflect the identity of the sequencing run.
Population: species name and island ("_L" is "San Cristobal", "_Z" is "Santa Cruz").
PC1...PC10: Sample scores on the 10 respective Principal Component axes.
pPCA_Eigenvector_variance.xlsx: Output file from the phylogenetic PCA giving the variance explained by each principal component.
#N: Principal component number ranked after the amount of variance explained
eigenvalue: the eigenvalue of the component
%variance: The percent of total variance explained by the component (given for the first 10 components only).
The following seven .txt-files are outputs from the ADMIXTURE analysis (Fig. 5C in the paper) that gives the Q matrix for K=2 to K=8, where "K" is the number of ancestral populations. The variables are:
ID: sample ID, where the first three letters denote species (first two letters) and island ("B" = "San Cristobal", "Z2 = "Santa Cruz"), and the to digits denote the last two digits in the Accession number (see "pPCAeigen.txt")
Species: species epithet
Island: Island in the Galapagos archipelago ("L" = "San Cristobal", "Z" = "Santa Cruz")
Prob: probability of the sample belonging to an ancestral population
Variable: name of ancestral population
Pop: name of population (combination of Species and Island)
qmatrix_2_VAR.txt: Q matrix for K = 2 (2 ancestral populations)
qmatrix_3_VAR.txt: Q matrix for K = 3 (3 ancestral populations)
qmatrix_4_VAR.txt: Q matrix for K = 4 (4 ancestral populations)
qmatrix_5_VAR.txt: Q matrix for K = 5 (5 ancestral populations)
qmatrix_6_VAR.txt: Q matrix for K = 6 (6 ancestral populations)
qmatrix_7_VAR.txt: Q matrix for K = 7 (7 ancestral populations)
qmatrix_8_VAR.txt: Q matrix for K = 8 (8 ancestral populations)
log.errors.txt: output file from the ADMIXTURE analysis giving the crossvalidation errors for each number of ancestral populations ("K"). These values were plotted in Fig 5D in the paper and shows the lowest CV error for K = 5.
Prune1st2000_GQ30NoMissNoZ.vcf.gz: compressed vcf-file that was the input file for the SNAPP analysis. SNAPP v. 1.6.1 (https://doi.org/10.1093/molbev/mss086) is an add-on package for the program BEAST2 v.2.7.3 (https://doi.org/10.1371/journal.pcbi.1006650). We used two individuals from each of the five ancestral populations identified by the Admixture analysis to reduce the computational load. We also employed stricter filtering (than for the PCA and Admixture analysis) to ensure high quality SNPs (details given in the paper).
Fig6_MaxCladeCredibility_tree.tre: The tree file visualized in Fig 6A in the paper. The tree file can be visualized with the FigTree software (http://github.com/rambaut/figtree/).
tree2.tre: The single MCC tree from Figure 6A for the construction of the phenogram in Figure 6B. The tree file can be visualized with the FigTree software (http://github.com/rambaut/figtree/).
TSL_5groups.csv: Trait file, giving the average total sperm length for the five ancestral populations identified by the Admixture analysis, used for construction of the phenogram in Figure 6B in the paper.
HedgesG.txt: Data file for the construction of Figure 7. Data for population pairs outside Galapagos originate from a previous paper (https://doi.org/10.1093/evolut/qpae154).
PairNo: Running number for the population pairs
Taxon: Species or genus name for the population pair
*Group": Identifier for Darwin's finch population pairs (= "Galapagos") or other population pairs (= "NONGalapagos")
DivTime_year: estimated divergence time for the population pair (years)
Gen_time: generation time (years), data extracted from https://doi.org/10.1111/cobi.13486 and the GenLength column in their Supplementary Table 4).
EPY: proportion of extrapair young
Hedgesg: the divergence in total sperm length between the two population as calculated by the "Hedges' g" metric from the data in "Sperm_lengths.txt".
3paral_runs_combinedLog.txt: Combined log file from 3 parallel SNAPP runs for the MCC tree (Figure 6A in paper). The log files were combined using LogCombiner (a utility program within BEAST2), and can be read and checked in Tracer v 1.7.2 (http://beast.bio.ed.ac.uk/Tracer). This procedure was used to ensure that the effective sample sizes of all model parameters were >200 for each completed run.
Sample: sample number for each tree (every 25 000. tree sampled)
posterior: model parameter "posterior"
likelihood: mode parameter "likelihood"
prior: model parameter "prior"
lambda: model parameter "lambda"
treeHeightLogger: model parameter "treeHeightLogger"
clockRate: model parameter "clockRate"
individuals5grps.txt: defines the individuals used in the SNAPP analysis.
species: name of groups of species used in the SNAPP analysis
individual: Sample identity, refers to the "SampleID" variable in "pPCAeigen.txt".
Constraints.txt: defines the constraints for the time divergence estimates in the SNAPP analysis.
Analytical code
script.R: the R script used for data processing, statistical analyses and figure construction. We used R v. 4.4.2 (https://www.r-project.org/)
SmartPCA.txt: script for the PCA analysis of SNPs. The script requires loading VCFtools v. 0.1.16 (https://doi.org/10.1093/bioinformatics/btr330), BCFtools v. 1.17 (https://doi.org/10.1093/gigascience/giab008) and EIGENSOFT v. 7.2.1 (https://bear-apps.bham.ac.uk/applications/2020a/EIGENSOFT/7.2.1-foss-2020a/).
Prep_Run_admixture.txt: script for running ADMIXTURE. The script requires loading VCFtools v. 0.1.16 (https://doi.org/10.1093/bioinformatics/btr330), plink v. 1.90 (https://www.cog-genomics.org/plink/), and Admixture v. 1.3.0 (https://doi.org/10.1101/gr.094052.109).
Prep_and_run_snapp.txt: script to prepare the vcf-file ("Prune1st2000_GQ30NoMissNoZ.vcf") for SNAPP and run SNAPP in BEAST2.
snapp.xml: the input script file (following the snapp_prep.rb script in https://doi.org/10.1093/sysbio/syy006) to run SNAPP in BEAST2.
The field work was carried out in March 2023 on the islands of San Cristobal and Santa Cruz. A total of 106 male birds were sampled for blood and sperm, of which 101 had detectable sperm. Sperm samples were collected by cloacal massage. All birds were released unharmed after sampling and measurements of their wing length and beak dimensions. Sperm samples were fixed in formalin and later photographed at 320x by bright-field microscopy for measurements (10 sperm cells per male). Mean total sperm length per male was used to calculate mean and SD for the populations and pairwise divergences using Hedges' g (Hedges 1981).
A total of 50 blood samples were extracted for DNA and sequenced on the Illumina Novaseq platform (150 bp, paired end). In addition, we downloaded 14 genomes from the sequence-read archive of NCBI (https://www.ncbi.nlm.nih.gov/sra) that originated from San Cristobal (Rubin et al. 2022). After genome assembly and SNP calling, using a high quality Camarhynchus parvulus genome as reference (Rubin et al. 2022), and various filtering, we had a final SNP dataset of 261 754 SNPs scored in all 64 individuals. This data set was the basis for a phylogenetic Principal Component Analysis and an ADMIXTURE analysis (Alexander et al. 2009) for the examination of the population genetic structure. To estimate the divergence time for the major genetic clusters, we performed a Bayesian phylogenetic analysis on a pruned dataset of 50K SNPs using the software SNAPP (Bryant et al. 2012) and 10 individuals representing five clusters or ancestral groups.
Pairwise divergences in sperm length were compared with similar data from other songbird populations (Lifjeld et al. 2024) in relation to their divergence time.
References:
Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Research,* 19*(9), 1655-1664. https://doi.org/10.1101/gr.094052.109
Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N. A., & RoyChoudhury, A. (2012). Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Molecular Biology and Evolution,* 29*(8), 1917-1932. https://doi.org/10.1093/molbev/mss086
Hedges, L. V. (1981). Distribution theory for Glass's estimator of effect size and related estimators. Journal of Educational Statistics,* 6*(2), 107-128. https://doi.org/10.3102/10769986006002107
Lifjeld, J. T., Cramer, E. R. A., Leder, E. H., & Voje, K. L. (2024). Sperm as a speciation phenotype in promiscuous songbirds. Evolution,* 79*(1), 134-143. https://doi.org/10.1093/evolut/qpae154
Rubin, C.-J., Enbody, E. D., Dobreva, M. P., Abzhanov, A., Davis, B. W., Lamichhaney, S., Pettersson, M., Sendell-Price, A. T., Sprehn, C. G., Valle, C. A., Vasco, K., Wallerman, O., Grant, B. R., Grant, P. R., & Andersson, L. (2022). Rapid adaptive radiation of Darwin’s finches depends on ancestral genetic modules. Science Advances,* 8*(27), eabm5982. https://doi.org/doi:10.1126/sciadv.abm5982
