Data from: Footprints of human migration in the population structure of wild baker’s yeast
Data files
Jan 22, 2025 version files 4.16 GB
-
admixture.zip
314.43 MB
-
fasta.zip
1.11 GB
-
README.md
6.84 KB
-
Scer246.mfa.gz
828.79 MB
-
Scer313.fasta.gz
1.06 GB
-
Scer50_wgalign.mfa
603.57 MB
-
timedivergence.zip
247.43 MB
Abstract
Humans have a long history of fermenting food and beverages that led to domestication of the wine or baker’s yeast, Saccharomyces cerevisiae. Despite their tight companionship with humans, yeast species that are domesticated or pathogenic can also live on trees. Here we used over 300 genomes of S. cerevisiae from oaks and other trees to determine whether tree-associated populations are genetically distinct from domesticated lineages and estimate the timing of forest lineage divergence. We found populations on trees are highly structured within Europe, Japan, and North America. Approximate estimates of when forest lineages diverged out of Asia and into North America and Europe coincide with the end of the last ice age, the spread of agriculture, and the onset of fermentation by humans. It appears that migration from human-associated environments to trees is ongoing. Indeed, patterns of ancestry in the genomes of three recent migrants from the trees of North America to Europe could be explained by the human response to the Great French Wine Blight. Our results suggest that human-assisted migration affects forest populations, albeit rarely. Such migration events may even have shaped the global distribution of S. cerevisiae. Given the potential for lasting impacts due to yeast migration between human and natural environments, it seems important to understand the evolution of human commensals and pathogens in wild niches.
README: Footprints of human migration in the population structure of wild baker’s yeast
https://doi.org/10.5061/dryad.pnvx0k6zq
Description of the data and file structure
Data were collected as described in the Methods section of Peña et al on "Footprints of human migration in the population structure of wild baker’s yeast"
More specifically, whole-genome sequences for strains sampled from trees were compiled from publicly available data (N = 295; Table S1) (Almeida et al. 2015; Barbosa et al. 2016; Bergström et al. 2014; Duan et al. 2018; Fay et al. 2019; Gayevskiy et al. 2016; Han et al. 2021; Pontes et al. 2020; Skelly et al. 2013; Song et al. 2015; Strope et al. 2015; Yue et al. 2017). We defined S. cerevisiae tree-sampled strains as those isolated from tree bark, exudate and leaves from trees or litter, and we also included strains from any soil. New whole-genome sequence data was generated for strains from trees in Indiana and Kentucky (N = 7; Osburn et al. 2018), North Carolina (N = 9; Diezmann & Dietrich 2009), Europe (N = 3; Robinson et al. 2016) and for new S. cerevisiae strains from the bark of white oak (Quercus alba) and live oak (Q. virginiana) from Georgia, Florida, Pennsylvania, and North Carolina (N = 15; Bensasson lab).
Reads were mapped to the S. cerevisiae reference genome, S288c (SacCer_Apr2011/sacCer3 from UCSC), using Burrows-Wheeler Aligner (bwa-mem, version 0.7.17; Li & Durbin, 2009). We used SAMtools to sort, index, and compress bam files and generated a consensus sequence using the mpileup function with the -I option to exclude indels (version 1.6; Li et al. 2009). Next, we used the BCFtools call function with the -c option to generate a consensus sequence (version 1.9) (Li et al. 2009) and converted from vcf format to fastq format in SAMtools using the vcfutils.pl vcf2fq command. Lastly, base calls with a phred-scaled quality score of less than 40 were treated as missing data (calls were converted to “N”) using seqtk seq -q 40 in SAMtools.
Alignments used in phylogenetic analyses:
- Multiple alignment file for the neighbor joining tree of strains in Figure 1A: Scer313.fasta.gz
- Multiple alignment file for the maximum likelihood tree in Figure S5, after exclusion of strains with evidence of mixed ancestry (<90% primary cluster assignment) in ADMIXTURE analysis: Scer246.mfa.gz
- Multiple alignment file used for maximum likelihood tree of non-admixed strains in Figure 2A: Scer50_wgalign.mfa
In admixture.zip
- Input plink files for analysis of 313 global strains using ADMIXTURE software, results shown in Figure 1B: Scer313_snps.bed.gz Scer313_snps.bim.gz Scer313_snps.fam.gz Scer313_snps.log.gz Scer313_snps.map.gz Scer313_snps.nosex.gz Scer313_snps.ped.gz Scer313_snps.stats.gz Scer313_snps.vcf.gz
- Input plink files for analysis of 51 European strains using ADMIXTURE software, results shown in Figure 2C: EU_Scer51_snps.bed.gz EU_Scer51_snps.bim.gz EU_Scer51_snps.fam.gz EU_Scer51_snps.log.gz EU_Scer51_snps.map.gz EU_Scer51_snps.nosex.gz EU_Scer51_snps.ped.gz EU_Scer51_snps.stats.gz
In fasta.zip
- 313 fasta files (*.fa.gz, e.g. ZP794.fa.gz) each containing consensus sequence (generated by mapping to the UCSC sacCer3 reference with no insertions). All sequences are therefore on the same coordinate system and multiple alignments can be generating by extracting the genomic region of interest and combining data from multiple strains. Please note that the name of each sequence in each file is as follows, and only the filename shows the name of the strain: sacCer3chrI sacCer3chrII sacCer3chrIII sacCer3chrIV sacCer3chrIX sacCer3chrM sacCer3chrV sacCer3chrVI sacCer3chrVII sacCer3chrVIII sacCer3chrX sacCer3chrXI sacCer3chrXII sacCer3chrXIII sacCer3chrXIV sacCer3chrXV sacCer3chrXVI
In timedivergence.zip
- Whole-chromosome alignments of all genomes used in time divergence analysis, results shown in Fig. 3 and Fig. S12: backBone_chr01.mfa backBone_chr02.mfa backBone_chr03.mfa backBone_chr04.mfa backBone_chr05.mfa backBone_chr06.mfa backBone_chr07.mfa backBone_chr08.mfa backBone_chr09.mfa backBone_chr10.mfa backBone_chr11.mfa backBone_chr12.mfa backBone_chr13.mfa backBone_chr14.mfa backBone_chr15.mfa backBone_chr16.mfa
- Positions of regions where chromosome painting suggests minimal admixture: chrCoord.txt
- Concatenated synonymous site alignments for a single study region per chromosome that were used in time divergence analysis. Synonymous sites were 2-fold, then 4-fold degenerate sites from concatenated genes after excluding overlapping genes and genes with introns. Reference strains that did not cluster with their primary clade assignment at a locus were also dropped: chr01allgenes_synSites_v2.fasta chr02allgenes_synSites_v2.fasta chr03allgenes_synSites_v2.fasta chr04allgenes_synSites_v2.fasta chr05allgenes_synSites_v2.fasta chr06allgenes_synSites_v2.fasta chr07allgenes_synSites_v2.fasta chr08allgenes_synSites_v2.fasta chr09allgenes_synSites_v2.fasta chr10allgenes_synSites_v2.fasta chr11allgenes_synSites_v2.fasta chr12allgenes_synSites_v2.fasta chr13allgenes_synSites_v2.fasta chr14allgenes_synSites_v2.fasta chr15allgenes_synSites_v2.fasta chr16allgenes_synSites_v2.fasta
Usage notes
- Files ending in *.gz are compressed and can be decompressed using the
gunzip
command in unix, or they are readable as text without decompression using thegunzip -c
command. - Multiple alignment files are best viewed and analyzed using software for evolutionary analysis e.g. with the free software SeaView, but are also readable as text files: Scer313.fasta.gz, Scer246.mfa.gz, Scer50_wgalign.mfa and all the files in timedivergence.zip except chrCoord.txt.
- chrCoord.txt is viewable with any text editor and contains a list of all chromosomes analyzed (1-16) and the coordinates used for time divergence analyisis e.g. from 120000 to 149999 relative on the sacCer3 reference coordinates.
- The fasta format consensus sequences in fasta.zip are viewable in any text editor.
- The admixture.zip folder needs to be decompressed before viewing. All the data types are viewable with a text editor except for *.bed files which are viewable using PLINK. For detailed explanation of the format for each type of file: https://www.cog-genomics.org/plink/1.9/formats. We used these files as input for genetic variant analysis with ADMIXTURE.
Access information
See Peña et al "Footprints of human migration in the population structure of wild baker’s yeast" for further information.
Methods
Data were collected as described in the Methods section of Peña et al on "Footprints of human migration in the population structure of wild baker’s yeast"
More specifically, whole-genome sequences for strains sampled from trees were compiled from publicly available data (N = 295; Table S1) (Almeida et al. 2015; Barbosa et al. 2016; Bergström et al. 2014; Duan et al. 2018; Fay et al. 2019; Gayevskiy et al. 2016; Han et al. 2021; Pontes et al. 2020; Skelly et al. 2013; Song et al. 2015; Strope et al. 2015; Yue et al. 2017). We defined S. cerevisiae tree-sampled strains as those isolated from tree bark, exudate and leaves from trees or litter, and we also included strains from any soil. New whole-genome sequence data was generated for strains from trees in Indiana and Kentucky (N = 7; Osburn et al. 2018), North Carolina (N = 9; Diezmann & Dietrich 2009), Europe (N = 3; Robinson et al. 2016) and for new S. cerevisiae strains from the bark of white oak (Quercus alba) and live oak (Q. virginiana) from Georgia, Florida, Pennsylvania, and North Carolina (N = 15; Bensasson lab).
Reads were mapped to the S. cerevisiae reference genome, S288c (SacCer_Apr2011/sacCer3 from UCSC), using Burrows-Wheeler Aligner (bwa-mem, version 0.7.17; Li & Durbin, 2009). We used SAMtools to sort, index, and compress bam files and generated a consensus sequence using the mpileup function with the -I option to exclude indels (version 1.6; Li et al. 2009). Next, we used the BCFtools call function with the -c option to generate a consensus sequence (version 1.9) (Li et al. 2009) and converted from vcf format to fastq format in SAMtools using the vcfutils.pl vcf2fq command. Lastly, base calls with a phred-scaled quality score of less than 40 were treated as missing data (calls were converted to “N”) using seqtk seq -q 40 in SAMtools. ### Files and variables