<i>k</i>-mer-based diversity scales with population size proxies more than nucleotide diversity in a meta-analysis of 98 plant species
Data files
Mar 27, 2025 version files 280.73 GB
-
a-species-kmers.tar.gz
6.39 GB
-
a-species-snps.tar.gz
15.95 GB
-
b-species-kmers.tar.gz
10.17 GB
-
b-species-snps.tar.gz
25.56 GB
-
c-species-kmers.tar.gz
14.09 GB
-
c-species-snps.tar.gz
25.94 GB
-
d-species-kmers.tar.gz
1.62 GB
-
d-species-snps.tar.gz
4.26 GB
-
e-species-kmers.tar.gz
470.20 MB
-
e-species-snps.tar.gz
753.89 MB
-
f-species-kmers.tar.gz
187.10 MB
-
f-species-snps.tar.gz
165.45 MB
-
g-species-kmers.tar.gz
7.67 GB
-
g-species-snps.tar.gz
25.41 GB
-
h-species-kmers.tar.gz
404.44 MB
-
h-species-snps.tar.gz
3.35 GB
-
j-species-kmers.tar.gz
1.92 GB
-
j-species-snps.tar.gz
5.79 GB
-
l-species-kmers.tar.gz
3.04 GB
-
l-species-snps.tar.gz
3.22 GB
-
m-species-kmers.tar.gz
4.39 GB
-
m-species-snps.tar.gz
6.31 GB
-
n-species-kmers.tar.gz
669.39 MB
-
n-species-snps.tar.gz
740.81 MB
-
o-species-kmers.tar.gz
7.43 GB
-
o-species-snps.tar.gz
21.47 GB
-
p-species-kmers.tar.gz
10.43 GB
-
p-species-snps.tar.gz
10.74 GB
-
q-species-kmers.tar.gz
1.59 GB
-
q-species-snps.tar.gz
7.14 GB
-
r-species-kmers.tar.gz
1.10 GB
-
r-species-snps.tar.gz
2.18 GB
-
README.md
19.14 KB
-
s-species-kmers.tar.gz
10.35 GB
-
s-species-snps.tar.gz
11.98 GB
-
species_list_2024-04-03.nwk
5.30 KB
-
t-species-kmers.tar.gz
569.52 MB
-
t-species-snps.tar.gz
1.17 GB
-
TableS2.xlsx
43.60 KB
-
v-species-kmers.tar.gz
5.69 GB
-
v-species-snps.tar.gz
9.23 GB
-
x-species-kmers.tar.gz
634.70 MB
-
x-species-snps.tar.gz
42.67 MB
-
z-species-kmers.tar.gz
5.74 GB
-
z-species-snps.tar.gz
4.78 GB
Abstract
A key prediction of neutral theory is that the level of genetic diversity in a population should scale with population size. However, as was noted by Richard Lewontin in 1974 and reaffirmed by later studies, the slope of the population size-diversity relationship in nature is much weaker than expected under neutral theory. We hypothesize that one contributor to this paradox is that current methods relying on single nucleotide polymorphisms (SNPs) called from aligning short reads to a reference genome underestimate levels of genetic diversity in many species. As a first step to testing this idea, we calculated nucleotide diversity (π) and k-mer-based metrics of genetic diversity across 112 plant species, amounting to over 205 terabases of DNA sequencing data from 27,488 individuals.
Dataset DOI: https://doi.org/10.5061/dryad.s1rn8pkk0
Data description
This dataset contains k-mer count matrices and SNP calls in VCF format for 112 different plant species. It also includes one table of covariates and a phylogeny of all the species included in the study, which was used for final statistical analyses.
Files and variables
TableS2.xlsx
This excel table contains two sheets named TableS2 and metadata. All of the final covariate values we used for statistical analyses are in the sheet named TableS2. Each row corresponds to one of the 112 species included in our dataset. The sheet named “metadata” describes what each of the columns in mean in the sheet named TableS2. The sheet named metadata is reproduced below for reference:
species | Name of species |
---|---|
h | Summed heterozygosity across all 4-fold degenerate sites (invariant and variant) in the genome |
ntotal | Total number of 4-fold degenerate sites (invariant and variant) in reference genome with called genotypes |
nvariant | Total number of 4-fold degenerate variant sites called in reference genome |
ninvariant | Total number of 4-fold degenerate invariant sites called in reference genome |
pi | Genome-wide average heterozygosity per 4-fold degenerate site |
bcd | Average pairwise Bray-Curtis dissimilarity |
jac | Average pairwise Jaccard dissimilarity |
umean | Average number of k-mers in the union of a pair of genotype’s k-mer count vectors |
nidv | Number of individuals included in the study |
totalbp | Total number of base pairs sequenced across all individuals |
cvbp | Coefficient of variation in the number of base pairs sequenced across individuals |
gbif_all_area | Range size in square kilometers estimated from GBIF occurrences, including native and invaded ranges |
gbif_native_area | Range size in square kilometers estimated from GBIF occurrences, including only native ranges |
species_updated | Name of species used to query WCVP range maps |
wcvp_native_area | Range size in square kilometers estimated from WCVP range maps, including native and invaded ranges |
wcvp_all_area | Range size in square kilometers estimated from WCVP range maps, including only native ranges |
DNA.amount..1.C..pg. | Haploid genome size in picograms |
Mating.system | Mating system of species, as used in phylogenetic least squares modeling. There are only two categories: outcrossing and not outcrossing (selfing/mixed/clonal) |
Ploidy | Ploidy level (2 for diploid, 4 for tetraploid, etc.) |
cultivation.status | Classification of species as “wild” or “cultivated” |
Generation.time | Classification of species based on life cycle habit, as used in phylogenetic least squares modeling. There are only two categories: annual and not annual (perenial/biennial/mixed) |
height_final | Plant height value in meters used for calculating our main proxy of population size: the range size-squared height ratio |
wcvp_all_popsize | range size-squared height ratio where range size is estimated from WCVP range maps, including native and invaded ranges |
wcvp_native_popsize | range size-squared height ratio where range size is estimated from WCVP range maps, including only native ranges |
gbif_all_popsize | range size-squared height ratio where range size is estimated from GBIF occurrences, including native and invaded ranges |
gbif_native_popsize | range size-squared height ratio where range size is estimated from GBIF occurrences, including only native ranges |
genome.size | Haploid genome size in base pairs (genome size in picograms x 0.978E9) |
meancov | Average genome-wide coverage per individual (totalbp/(genome.size*nidv)) |
species_list_2024-04-03.nwk
The phylogeny we used for our analysis is species_list_2024-04-03.nwk. The file format is newick and was downloaded from timetree.org on March 3rd, 2024. The branch lengths and divergence times are in millions of years.
K-mer counts
There is one k-mer count matrix per species and each matrix file follows this naming convention: <genus>_<species>_AllMergedKmerCounts.txt.gz
Where <genus>
and <species>
are the genus and species (all lowercase) names for each species.
Each matrix has the column, “kmer”, as it’s first column, which gives a k-mer sequence. The subsequent columns are NCBI biosample identifiers where the k-mers were derived from. The elements of the matrix are the count of a given k-mer in a given biosample. Each matrix has 10,000,000 rows corresponding to 10,000,000 randomly sampled 31-mers that have count of 5 or more in at least one biosample.
To meet the file number limit of Dryad (no more than 100 files total per repository), these matrices are consolidated into tar archives according to the first letter of their genus name. Below are the exact k-mer matrices consolidated into each tar archive.
a-species-kmers.tar.gz
- actinidia_chinensis_AllMergedKmerCounts.txt.gz
- amaranthus_hypochondriacus_AllMergedKmerCounts.txt.gz
- ananas_comosus_AllMergedKmerCounts.txt.gz
- arabidopsis_halleri_AllMergedKmerCounts.txt.gz
- arabidopsis_lyrata_AllMergedKmerCounts.txt.gz
- arabidopsis_suecica_AllMergedKmerCounts.txt.gz
- arabidopsis_thaliana_AllMergedKmerCounts.txt.gz
- arabis_alpina_AllMergedKmerCounts.txt.gz
- arabis_nemorensis_AllMergedKmerCounts.txt.gz
- arachis_duranensis_AllMergedKmerCounts.txt.gz
- arachis_hypogaea_AllMergedKmerCounts.txt.gz
- arachis_ipaensis_AllMergedKmerCounts.txt.gz
b-species-kmers.tar.gz
- benincasa_hispida_AllMergedKmerCounts.txt.gz
- beta_vulgaris_AllMergedKmerCounts.txt.gz
- boechera_stricta_AllMergedKmerCounts.txt.gz
- brachypodium_distachyon_AllMergedKmerCounts.txt.gz
- brassica_juncea_AllMergedKmerCounts.txt.gz
- brassica_napus_AllMergedKmerCounts.txt.gz
- brassica_oleracea_AllMergedKmerCounts.txt.gz
- brassica_rapa_AllMergedKmerCounts.txt.gz
- buddleja_alternifolia_AllMergedKmerCounts.txt.gz
c-species-kmers.tar.gz
- cajanus_cajan_AllMergedKmerCounts.txt.gz
- camelina_sativa_AllMergedKmerCounts.txt.gz
- camellia_sinensis_AllMergedKmerCounts.txt.gz
- cannabis_sativa_AllMergedKmerCounts.txt.gz
- capsella_grandiflora_AllMergedKmerCounts.txt.gz
- capsella_rubella_AllMergedKmerCounts.txt.gz
- capsicum_annuum_AllMergedKmerCounts.txt.gz
- castanea_mollissima_AllMergedKmerCounts.txt.gz
- chenopodium_quinoa_AllMergedKmerCounts.txt.gz
- cicer_arietinum_AllMergedKmerCounts.txt.gz
- citrullus_lanatus_AllMergedKmerCounts.txt.gz
- coffea_arabica_AllMergedKmerCounts.txt.gz
- coffea_canephora_AllMergedKmerCounts.txt.gz
- corylus_americana_AllMergedKmerCounts.txt.gz
- corylus_avellana_AllMergedKmerCounts.txt.gz
- cucumis_melo_AllMergedKmerCounts.txt.gz
- cucumis_sativus_AllMergedKmerCounts.txt.gz
- cucurbita_argyrosperma_AllMergedKmerCounts.txt.gz
- cucurbita_pepo_AllMergedKmerCounts.txt.gz
d-species-kmers.tar.gz
- digitaria_exilis_AllMergedKmerCounts.txt.gz
- dioscorea_alata_AllMergedKmerCounts.txt.gz
- dioscorea_rotundata_AllMergedKmerCounts.txt.gz
- durio_zibethinus_AllMergedKmerCounts.txt.gz
e-species-kmers.tar.gz
- elaeis_guineensis_AllMergedKmerCounts.txt.gz
f-species-kmers.tar.gz
- ficus_carica_AllMergedKmerCounts.txt.gz
g-species-kmers.tar.gz
- glycine_max_AllMergedKmerCounts.txt.gz
- glycine_soja_AllMergedKmerCounts.txt.gz
- gossypium_arboreum_AllMergedKmerCounts.txt.gz
- gossypium_barbadense_AllMergedKmerCounts.txt.gz
- gossypium_hirsutum_AllMergedKmerCounts.txt.gz
h-species-kmers.tar.gz
- helianthus_annuus_AllMergedKmerCounts.txt.gz
- heliosperma_pusillum_AllMergedKmerCounts.txt.gz
j-species-kmers.tar.gz
- juglans_regia_AllMergedKmerCounts.txt.gz
l-species-kmers.tar.gz
- lactuca_sativa_AllMergedKmerCounts.txt.gz
- lens_ervoides_AllMergedKmerCounts.txt.gz
- linum_usitatissimum_AllMergedKmerCounts.txt.gz
- lotus_japonicus_AllMergedKmerCounts.txt.gz
- lupinus_angustifolius_AllMergedKmerCounts.txt.gz
m-species-kmers.tar.gz
- macadamia_integrifolia_AllMergedKmerCounts.txt.gz
- malus_domestica_AllMergedKmerCounts.txt.gz
- malus_sylvestris_AllMergedKmerCounts.txt.gz
- mangifera_indica_AllMergedKmerCounts.txt.gz
- manihot_esculenta_AllMergedKmerCounts.txt.gz
- medicago_truncatula_AllMergedKmerCounts.txt.gz
- mimulus_guttatus_AllMergedKmerCounts.txt.gz
- miscanthus_sinensis_AllMergedKmerCounts.txt.gz
- momordica_charantia_AllMergedKmerCounts.txt.gz
- musa_acuminata_AllMergedKmerCounts.txt.gz
n-species-kmers.tar.gz
- nelumbo_nucifera_AllMergedKmerCounts.txt.gz
- nicotiana_tabacum_AllMergedKmerCounts.txt.gz
o-species-kmers.tar.gz
- olea_europaea_AllMergedKmerCounts.txt.gz
- oryza_barthii_AllMergedKmerCounts.txt.gz
- oryza_brachyantha_AllMergedKmerCounts.txt.gz
- oryza_glaberrima_AllMergedKmerCounts.txt.gz
- oryza_longistaminata_AllMergedKmerCounts.txt.gz
- oryza_punctata_AllMergedKmerCounts.txt.gz
- oryza_rufipogon_AllMergedKmerCounts.txt.gz
- oryza_sativa_AllMergedKmerCounts.txt.gz
p-species-kmers.tar.gz
- papaver_somniferum_AllMergedKmerCounts.txt.gz
- phaseolus_vulgaris_AllMergedKmerCounts.txt.gz
- phoenix_dactylifera_AllMergedKmerCounts.txt.gz
- populus_deltoides_AllMergedKmerCounts.txt.gz
- populus_trichocarpa_AllMergedKmerCounts.txt.gz
- prunus_armeniaca_AllMergedKmerCounts.txt.gz
- prunus_avium_AllMergedKmerCounts.txt.gz
- prunus_persica_AllMergedKmerCounts.txt.gz
q-species-kmers.tar.gz
- quercus_lobata_AllMergedKmerCounts.txt.gz
- quercus_robur_AllMergedKmerCounts.txt.gz
r-species-kmers.tar.gz
- raphanus_sativus_AllMergedKmerCounts.txt.gz
- rhododendron_griersonianum_AllMergedKmerCounts.txt.gz
s-species-kmers.tar.gz
- salix_dunnii_AllMergedKmerCounts.txt.gz
- sesamum_indicum_AllMergedKmerCounts.txt.gz
- setaria_italica_AllMergedKmerCounts.txt.gz
- setaria_viridis_AllMergedKmerCounts.txt.gz
- solanum_lycopersicum_AllMergedKmerCounts.txt.gz
- solanum_stenotomum_AllMergedKmerCounts.txt.gz
- sorghum_bicolor_AllMergedKmerCounts.txt.gz
- spinacia_oleracea_AllMergedKmerCounts.txt.gz
- spirodela_polyrhiza_AllMergedKmerCounts.txt.gz
- striga_hermonthica_AllMergedKmerCounts.txt.gz
t-species-kmers.tar.gz
- tetracentron_sinense_AllMergedKmerCounts.txt.gz
- thlaspi_arvense_AllMergedKmerCounts.txt.gz
v-species-kmers.tar.gz
- vanilla_planifolia_AllMergedKmerCounts.txt.gz
- vigna_radiata_AllMergedKmerCounts.txt.gz
- vigna_umbellata_AllMergedKmerCounts.txt.gz
- vigna_unguiculata_AllMergedKmerCounts.txt.gz
- vitis_vinifera_AllMergedKmerCounts.txt.gz
x-species-kmers.tar.gz
- xanthoceras_sorbifolium_AllMergedKmerCounts.txt.gz
z-species-kmers.tar.gz
- zea_mays_AllMergedKmerCounts.txt.gz
- zizania_palustris_AllMergedKmerCounts.txt.gz
- ziziphus_jujuba_AllMergedKmerCounts.txt.gz
SNP VCFs
There is one VCF file per contig/chromosome per species and each matrix file follows this naming convention: filtered_variantAndInvariant_<genus>_<species>_<chromosome_name>.vcf.gz
Where <genus>
and <species>
are the genus and species (all lowercase) names for each species and <chromosome_name>
is the name of the contig/chromosome in the VCF file.
The vcfs are bgzipped and are not filtered by read depth, but do include every filter described in the methods of this repository.
To meet the file number limit of Dryad (no more than 100 files total per repository), these folders of VCFs are consolidated into tar archives according to the first letter of their genus name. Below are the exact VCF folders matrices consolidated into each tar archive.
a-species-snps.tar.gz
- actinidia_chinensis/
- amaranthus_hypochondriacus/
- ananas_comosus/
- arabidopsis_halleri/
- arabidopsis_lyrata/
- arabidopsis_suecica/
- arabidopsis_thaliana/
- arabis_alpina/
- arabis_nemorensis/
- arachis_duranensis/
- arachis_hypogaea/
- arachis_ipaensis/
b-species-snps.tar.gz
- benincasa_hispida/
- beta_vulgaris/
- boechera_stricta/
- brachypodium_distachyon/
- brassica_juncea/
- brassica_napus/
- brassica_oleracea/
- brassica_rapa/
- buddleja_alternifolia/
c-species-snps.tar.gz
- cajanus_cajan/
- camelina_sativa/
- camellia_sinensis/
- cannabis_sativa/
- capsella_grandiflora/
- capsella_rubella/
- capsicum_annuum/
- castanea_mollissima/
- chenopodium_quinoa/
- cicer_arietinum/
- citrullus_lanatus/
- coffea_arabica/
- coffea_canephora/
- corylus_americana/
- corylus_avellana/
- cucumis_melo/
- cucumis_sativus/
- cucurbita_argyrosperma/
- cucurbita_pepo/
d-species-snps.tar.gz
- digitaria_exilis/
- dioscorea_alata/
- dioscorea_rotundata/
- durio_zibethinus/
e-species-snps.tar.gz
- elaeis_guineensis/
f-species-snps.tar.gz
- ficus_carica/
g-species-snps.tar.gz
- glycine_max/
- glycine_soja/
- gossypium_arboreum/
- gossypium_barbadense/
- gossypium_hirsutum/
h-species-snps.tar.gz
- helianthus_annuus/
- heliosperma_pusillum/
j-species-snps.tar.gz
- juglans_regia/
l-species-snps.tar.gz
- lactuca_sativa/
- lens_ervoides/
- linum_usitatissimum/
- lotus_japonicus/
- lupinus_angustifolius/
m-species-snps.tar.gz
- macadamia_integrifolia/
- malus_domestica/
- malus_sylvestris/
- mangifera_indica/
- manihot_esculenta/
- medicago_truncatula/
- mimulus_guttatus/
- miscanthus_sinensis/
- momordica_charantia/
- musa_acuminata/
n-species-snps.tar.gz
- nelumbo_nucifera/
- nicotiana_tabacum/
o-species-snps.tar.gz
- olea_europaea/
- oryza_barthii/
- oryza_brachyantha/
- oryza_glaberrima/
- oryza_longistaminata/
- oryza_punctata/
- oryza_rufipogon/
- oryza_sativa/
p-species-snps.tar.gz
- papaver_somniferum/
- phaseolus_vulgaris/
- phoenix_dactylifera/
- populus_deltoides/
- populus_trichocarpa/
- prunus_armeniaca/
- prunus_avium/
- prunus_persica/
q-species-snps.tar.gz
- quercus_lobata/
- quercus_robur/
r-species-snps.tar.gz
- raphanus_sativus/
- rhododendron_griersonianum/
s-species-snps.tar.gz
- salix_dunnii/
- sesamum_indicum/
- setaria_italica/
- setaria_viridis/
- solanum_lycopersicum/
- solanum_stenotomum/
- sorghum_bicolor/
- spinacia_oleracea/
- spirodela_polyrhiza/
- striga_hermonthica/
t-species-snps.tar.gz
- tetracentron_sinense/
- thlaspi_arvense/
v-species-snps.tar.gz
- vanilla_planifolia/
- vigna_radiata/
- vigna_umbellata/
- vigna_unguiculata/
- vitis_vinifera/
x-species-snps.tar.gz
- xanthoceras_sorbifolium/
z-species-snps.tar.gz
- zea_mays/
- zizania_palustris/
- ziziphus_jujuba/
Code and Software
TableS2.xlsx can be opened with any excel file reader or most spreadsheet software.
Any tar archive can be unpacked with a command like the following: tar -xzvf a-species-snps.tar.gz
.
Each k-mer matrix within a tar archive is also gzipped to reduce download size. Any given matrix can be decompressed with a command like the following: gunzip arabidopsis_thaliana_AllMergedKmerCounts.txt.gz
.
Each VCF file within a tar archives is also bgzipped (.vcf.gz file type) to reduce download size but each file could be decompressed with a command like bgzip -d genus_species_chromosome.vcf.gz
. The BGZIP utility is part of the samtools/htslib library (https://www.htslib.org/).
Access information
Data was derived from the following sources:
- NCBI Sequence Read Archive: https://www.ncbi.nlm.nih.gov/sra
- Plant DNA c-values database: https://cvalues.science.kew.org/
- World Checklist of Vascular Plants: https://powo.science.kew.org/
- Global Center for Biodiversity Information: https://www.gbif.org/
- Encyclopedia of Life: https://eol.org/
- Timetree: https://timetree.org/
The workflow we used to create these datasets is packaged as a snakemake workflow stored here: https://github.com/milesroberts-123/tajimasDacrossSpecies
Please see the file named lewontin_paradox_methods_figures.pdf in the Zenodo submission attached to this repository for a full breakdown on the methods and references we used for SNP-calling, k-mer-counting, and scraping literature for genome size and life history variables. Some figures showing plots of the data in TableS2.xlsx are also included for reference.