Complex polyploids: Origins, genomic composition, and role of introgressed alleles
Data files
Sep 01, 2023 version files 51.73 GB
-
P77_COMPX-POLYPLDS_DRYAD_DATA_LEAL.tzr.gz
-
README.md
Abstract
Introgression allows polyploid species to acquire new genomic content from diploid progenitors or from other unrelated diploid or polyploid lineages, contributing to genetic diversity and facilitating adaptive allele discovery. In some cases, high levels of introgression elicit the replacement of large numbers of alleles inherited from the polyploid's ancestral species, profoundly reshaping the polyploid's genomic composition. In such complex polyploid, it is often difficult to determine which taxa were the progenitor species and which taxa provided additional introgressive blocks through subsequent hybridization. Here, we use population-level genomic data to reconstruct the phylogenetic history of Betula pubescens (downy birch), a tetraploid species often assumed to be of allopolyploid origin and which is known to hybridize with at least four other birch species. This was achieved by modeling polyploidization and introgression events under the multispecies coalescent and then using an approximate Bayesian computation (ABC) rejection algorithm to evaluate and compare competing polyploidization models. We provide evidence that B. pubescens is the outcome of an autoploid genome doubling event in the common ancestor of B. pendula and its extant sister species, B. platyphylla, that took place approximately 178,000-188,000 generations ago. Extensive hybridization with B. pendula, B. nana, and B. humilis followed in the aftermath of autopolyploidization, with the relative contribution of each of these species to the B. pubescens genome varying markedly across the species' range. Functional analysis of B. pubescens loci containing alleles introgressed from B. nana identified multiple genes involved in climate adaptation, while loci containing alleles derived from B. humilis revealed several genes involved in the regulation of meiotic stability and pollen viability in plant species.
README: Supplementary data for: Complex Polyploids: Origins, Genomic Composition, and Role of Introgressed Alleles
J.L. Leal, Pascal Milesi, Eva Hodková, Qiujie Zhou, Jennifer James, D. Magnus Eklund, Tanja Pyhäjärvi, Jarkko Salojärvi, and Martin Lascoux
https://doi.org/10.5061/dryad.5tb2rbp9f
Scripts used for analysis of complex polyploids are available at:
https://github.com/LLN273/Complex-Polyploids
General description
A. Analysis of B. pubescens genomic data
Data used to produce Fig. 2a, Fig. 2b, Fig. S7, and Fig. S8.
01_MSAs_BEFORE_POLARIZATION.tzr.gz
02_MSAs_AFTER_POLARIZATION.tzr.gz
03_IQTREE_RESULTS.tzr.gz
04_Polyploid_pairing_profiles.tzr.gz
B. STRUCTURE analysis:
Results from STRUCTURE analysis (used to produce Fig. 2c, Fig. S9, and Fig. S10).
05_STRUCTURE_analysis.tzr.gz
C. DAPC analysis:
Input data used to produce Fig. 2d, and Fig. S11.
06_DAPC_adegenet_analysis.tzr.gz
D. Simulated datasets and polyploidization model testing:
Data used to produce Fig. 3, Fig. 4, Table S2, Fig. S12, and Fig. S13.
07_SIMULATION_DATA_AND_POLYPLOIDIZATION_MODEL_TESTING.tzr.gz
E. Demographic modeling:
Input data used to produce Fig. 5, Table S3, and Table S4.
08_Demographic_modeling.tzr.gz
F. Analysis of B. pubescens alleles of B. nana or B. humilis origin: geographic distribution & functional analysis:
Data used to produce Fig. 6, Fig. 7, and Table 1.
09_nana_humilis_alleles_geographic_distribution.tzr.gz
Detailed description
A. Analysis of B. pubescens genomic data
A.1 Multiple-sequence alignments (MSAs) for different samples and individual gene families, BEFORE polarization.
Data provided for 269 B. pubescens specimens, 25 B. pendula autotetraploids, and 2 triploid hybrids..
MSAs are provided both in fasta (.fa) and nexus (.nxs) format. Actual pipeline was run using MSAs in fasta format.
MSAs can be opened using an alignment viewer, for example, AliView (https://ormbunkar.se/aliview/).
Produced using scripts in https://github.com/LLN273/Complex-Polyploids [01_Read_mapping_and_variant_calling and 03_Generate_MSAs], based on the raw Illumina libraries available in The European Nucleotide Archive (ENA) at www.ebi.ac.uk under the PRJEB64873 project accession code.
Folders:
01_MSAs_BEFORE_POLARIZATION/fasta
A.2 MSAs for different samples and individual gene families, AFTER polarization of the focal polyploid sequence.
MSAs are provided both in fasta (.fasta) and nexus (.nxs) format. Actual pipeline was run using MSAs in fasta format.
Produced using scripts in https://github.com/LLN273/Complex-Polyploids/tree/main/04_Polarize_polyploid
Input data: MSAs described in A.1 above
# 2.A Reference sequence used during polarization: B. pendula
Folders:
02_MSAs_AFTER_POLARIZATION/A_Bpubescens_POLARIZED.1_REFSEQ.PENDULA_FINLAND_LOIMAA.LOIMAA
# 2.B Reference sequence used during polarization: B. nana
Folders:
02_MSAs_AFTER_POLARIZATION/B_Bpubescens_POLARIZED.2_REFSEQ.NANA_FINLAND_ENONTEKIO.A002_2
# 2.C Reference sequence used during polarization: B. platyphylla
Folders:
02_MSAs_AFTER_POLARIZATION/C_Bpubescens_POLARIZED.3_REFSEQ.PLATYPHYLLA_RUSSIA.A001_1
# 2.D Reference sequence used during polarization: B. humilis
Folders:
02_MSAs_AFTER_POLARIZATION/D_Bpubescens_POLARIZED.4_REFSEQ.HUM02
A.3 IQTREE2 output files, for different samples and polarization geometries.
Multiple output files (.iqtree and .log). Can be opened with a text editor.
Produced using script 01_IQ-TREE2_SINGLEGenes_query.sh in https://github.com/LLN273/Complex-Polyploids/tree/main/05_Phylogenetic_Analysis
Input data: MSAs described in A.2
Folders:
03_IQTREE_RESULTS/A_Bpubescens_POLARIZED.1_REFSEQ.PENDULA_FINLAND_LOIMAA.LOIMAA
03_IQTREE_RESULTS/B_Bpubescens_POLARIZED.2_REFSEQ.NANA_FINLAND_ENONTEKIO.A002_2
03_IQTREE_RESULTS/C_Bpubescens_POLARIZED.3_REFSEQ.PLATYPHYLLA_RUSSIA.A001_1
03_IQTREE_RESULTS/D_Bpubescens_POLARIZED.4_REFSEQ.HUM02
A.4 Observed polyploid pairing profiles, for different SAMPLES and polarization geometries.
Produced using scripts in https://github.com/LLN273/Complex-Polyploids/tree/main/05_Phylogenetic_Analysis
Input data: MSAs described in A.3
Folders:
04_Polyploid_pairing_profiles/A_Bpubescens_POLARIZED.1_REFSEQ.PENDULA_FINLAND_LOIMAA.LOIMAA
04_Polyploid_pairing_profiles/B_Bpubescens_POLARIZED.2_REFSEQ.NANA_FINLAND_ENONTEKIO.A002_2
04_Polyploid_pairing_profiles/C_Bpubescens_POLARIZED.3_REFSEQ.PLATYPHYLLA_RUSSIA.A001_1
04_Polyploid_pairing_profiles/D_Bpubescens_POLARIZED.4_REFSEQ.HUM02
File description:
sister_ID_analysis_COUNTS-XXXXX.txt lists number of times (and normalized frequency) polarized focal polyploid sequence pairs with different species/clades included in the MSA
sister_ID_analysis_summary_-XXXXX.txt polyploid pairing profile (same information as above for selected species and clades).
A.5 Observed polyploid pairing profiles, for different POPULATIONS and polarization geometries.
Used to produce Fig. 2a, Fig. 2b, Fig. S7, and Fig. S8.
Input data: polyploid pairing profiles listed A.4
Folders:
04_Polyploid_pairing_profiles/SUMMARY_TABLES_Observed_Pairing_Profiles_per_population
B. STRUCTURE analysis
Input data used to produce Fig. 2c, Fig. S9, and Fig. S10.
Figures produced using scripts in https://github.com/LLN273/Complex-Polyploids/tree/main/06_Prepare_data_for_STRUCTURE_and_adegenet_analyses and https://github.com/LLN273/Complex-Polyploids/tree/main/07_STRUCTURE_analysis
Folders:
05_STRUCTURE_analysis/15_NANA_SILVER_WHITE_HUMILIS_PLATYPHYLLA_ReducedPEND_ReducedSW_10kSNP_SHORT/
Files (all files can be opened using a text editor):
mainparams STRUCTURE input file (key parameters file)
extraparams STRUCTURE input file (extra parameters file)
XXX.str input data file
seed.txt and strdataset files produced by structure; can be ignored
Folders:
05_STRUCTURE_analysis/15_NANA_SILVER_WHITE_HUMILIS_PLATYPHYLLA_ReducedPEND_ReducedSW_10kSNP_SHORT/STRUCTURE_RESULTS
Files (all files can be opened using a text editor):
str_KXX_repYY_f results from STRUCTURE analysis, computed for K=XX , replicate YY [K=2...9; 16 replicates for each K]
C. DAPC analysis
Input data used in DAPC analysis used to produce Fig. 2d, and Fig. S11.
Compressed VCF file produced using scripts in https://github.com/LLN273/Complex-Polyploids/tree/main/06_Prepare_data_for_STRUCTURE_and_adegenet_analyses
Compressed VCF file can be viewed using the bash command "zcat myfile.VCF.gz"
DAPC plots produced using scripts in https://github.com/LLN273/Complex-Polyploids/tree/main/08_DAPC_adegenet
Folders:
06_DAPC_adegenet_analysis/INPUT_VCF_FILE/
D. Simulated datasets and polyploidization model testing
D.1 Generation of simulated MSAs
Data used to produce Fig. 3, Fig. 4, Table S2, Fig. S12, and Fig. S13.
Simulated gene trees and MSAs; 25 replicates were generated; 50 gene families were generated for each replicate.
For each replicate, we provide the raw data obtained after running Simphy and INDELible.
Trees can be displayed using, for example, TreeGraph2 (http://treegraph.bioinfweb.info/).
MSAs can be opened using an alignment viewer, for example, AliView (https://ormbunkar.se/aliview/).
D.1.1 Gene trees generated by Simphy, after fixing tree tips.
Produced using scripts 01_simphy_ILS_birch.sh and 02_fix_tree_tips_lowILS_MAIN.sh in https://github.com/LLN273/Complex-Polyploids/tree/main/09_Model_Testing_1_Polyploidization_and_hybridization_simulations
Folders:
07_SIMULATION_DATA_AND_POLYPLOIDIZATION_MODEL_TESTING/01_simphy_ILS/A_modILS_birch_CLEAN
Files:
g_treesXXXX.trees simulated gene trees, after fixing tips
A_modILS_birch.command, A_modILS_birch.db, A_modILS_birch.params extra files produced by simphy (can be ignored)
l_trees.trees, s_tree.trees, and Rplots.pdf extra files produced by simphy and our home scripts (can be ignored)
02a_fix_tree_tips_birch.R R script used to fix tree tips after running Simphy
D.1.2 Contains gene trees after ILS correction and MSAs produced by INDELible.
Produced using scripts 03_adjust_ILS_NEW_MAIN.sh and 04_INDELIble.sh in https://github.com/LLN273/Complex-Polyploids/tree/main/09_Model_Testing_1_Polyploidization_and_hybridization_simulations
Input data: gene trees produced in D.1.1
Folders:
07_SIMULATION_DATA_AND_POLYPLOIDIZATION_MODEL_TESTING/01_simphy_ILS/A_modILS_birch_CLEAN_reducedILS_40perct
Files:
g_treesXXXX.trees simulated gene trees, after ILS correction
data_XXXX_TRUE.fasta MSAs generated by INDELible
A_modILS_birch.command, A_modILS_birch.db, A_modILS_birch.params extra files produced by simphy (can be ignored)
l_trees.trees, s_tree.trees extra files produced by simphy (can be ignored)
LOG.txt, trees.txt, control.txt extra files produced by INDELible (can be ignored)
D.2 Results from polyploidization model evaluation
Simulated polyploid pairing profiles estimated using an approximate Bayesian computation (ABC) framework based on nine polyploidization models [AAAA, AAHH, AANH, AANN, PPHH, PPNH, PPNN, PPPP, PPyPPy].
ABC results also available for models that include homoeologous exchange (HE).
For each polyploidization model, ABC was carried out using priors sampled from distributions centered on model parameters initially estimated using simulated annealing (SA), based on observed polyploid pairing profiles for seven B. pubescens populations [Arctic, Central Asia, JOK, LT, Spain, SVsouth, UA]
Produced using scripts in https://github.com/LLN273/Complex-Polyploids/tree/main/10_Model_Testing_2_Model_evaluation
Input data: simulated MSAs computed in D.1.2
Folders:
07_SIMULATION_DATA_AND_POLYPLOIDIZATION_MODEL_TESTING/03_ABC_Final_optimization
Files (all files can be opened using a text editor):
summary_ABC_1000_simulations.txt ABC-simulated polyploid pairing profiles for 1,000 independent runs
L2-norm_ABC_1000_simulations.txt L2 norm distance computed for each ABC run, using observed polyploid pairing profiles as a reference
ABC_PRIORS_PPPP_SVsouth (folder) ABC priors for the twelve model parameters (HO, H1, ..., H9, ILSreduction, platy_pend_gene_flow)
E. Demographic modeling
Input data used to produce Fig. 5, Table S3, and Table S4.
Demographic modeling results obtained using the scripts in https://github.com/LLN273/Complex-Polyploids/tree/main/11_Demographic_modeling
Folders:
08_Demographic_modeling/INPUT_DATA/
Files:
[Note: all files can be opened using a text editor, apart from compressed VCF files which can be viewed using the bash command "zcat myfile.VCF.gz"]
MASTER_SingleRuns_ALLsamples_GOODonly_SILVER_WHITE_PLATYPHYLLA.vcf.gz input VCF file
Bpendula.annotation-targetGenes_exons_CLEAN.bed B. pendula annotation file covering list of exons include in targeted exome probes
Bpendula.annotation-targetGenes_introns_CLEAN.bed B. pendula annotation file covering list of introns include in targeted exome probes
00_samples_SILVER_WHITE_PLATYPHYLLA_divergenceTime_XXXX_4_8.args List of samples included in analysis, with B. pubescens samples collected from XXXX population
presence_matrix_annotated_genes_pendula_XXXX-Samples_4.txt List of loci containing alleles exclusively of B. pendula and/or B. platyphylla ancestry; computed separately for each B. pubescens population
Nana-Finland-Enontekio.A002-2_snps-CLEAN.vcf.gz VCF file listing variants observed in B. nana Finland-Enontekio.A002-2 sample; used to estimate ancestral B. pendula allele
Populifolia-Canada.A005-11_snps-CLEAN.vcf.gz VCF file listing variants observed in B. populifolia Canada.A005-11 sample; used to estimate ancestral B. pendula allele
Occidentalis-Canada-Alberta.A009-17_snps-CLEAN.vcf.gz VCF file listing variants observed in B. occidentalis Canada-Alberta.A009-17 sample; used to estimate ancestral B. pendula allele
00_samples_ancestral_state_PENDULA_8.txt List of B. pendula samples included in analysis
00_samples_ancestral_state_PLATYPHYLA_8.txt List of B. platyphylla samples included in analysis
10_simulation_birch_3pop_divTime_model_YYY.est fastsimcoal2 est input file for model YYY
10_simulation_birch_3pop_divTime_model_YYY.tpl fastsimcoal2 tpl input file for model YYY
16_PAR_ending.txt file required when running script 16_fastsimcola26_confidence_intervals_MAIN.sh in https://github.com/LLN273/Complex-Polyploids/tree/main/11_Demographic_modeling
F. Analysis of B. pubescens alleles of B. nana or B. humilis origin: geographic distribution & functional analysis
Data used to produce Fig. 6, Fig. 7, and Table 1.
Produced using scripts in https://github.com/LLN273/Complex-Polyploids/tree/main/12_nana_humilis_alleles_geographic_distribution
Folders:
09_nana_humilis_alleles_geographic_distribution
Files (all files can be opened using a text editor):
db_hum_pop_ALL.txt Fraction of B. pubescens individuals in a population containing an allele of B. humilis origin
db_nana_pop_ALL.txt Fraction of B. pubescens individuals in a population containing an allele of B. nana origin
db_pend_pop_ALL.txt Fraction of B. pubescens individuals in a population containing an allele of B. pendula origin
ID_genes-XXXX_summary_ALT1.txt List of loci containing a. B. pubescens allele of XXXX origin, for each B. pubescens sample, after polarizing the focal B. pubescens sequence (four polarizing geometries tested: ALT1, ALT2, ALT3, and ALT4)
presence_matrix_after_kmeans_annotated_genes.txt GO annotation for B. pubescens geneset of interest
presence_matrix_after_kmeans_hum_ALLgenes.txt Fraction of B. pubescens individuals in a population containing an allele of B. humilis origin (all genes); after k-means cluster analysis
presence_matrix_after_kmeans_hum_annotated_0.65_NEW2.txt Fraction of B. pubescens individuals in a population containing an allele of B. humilis origin (only genes where fraction > 0.65 is observed in at least one population); after k-means cluster analysis
presence_matrix_after_kmeans_nana_ALLgenes.txt Fraction of B. pubescens individuals in a population containing an allele of B. nana origin (all genes); after k-means cluster analysis
presence_matrix_after_kmeans_nana_annotated_0.65_NEW2.txt Fraction of B. pubescens individuals in a population containing an allele of B. nana origin (only genes where fraction > 0.65 is observed in at least one population); after k-means cluster analysis
Methods
Analysis based on 269 newly sequenced B. pubescens individuals collected from twenty-tree locations along the species' range. Additional accessions were obtained from the European Nucleotide Archive (ENA) for B. lenta, B. nana, B. occidentalis, B. pendula, B. platyphylla, B. populifolia, B. pubescens, Alnus glutinosa, and A. incana, at www.ebi.ac.uk, under the PRJEB14544 accession code. For the newly collected samples, individual libraries were generated using custom probes designed for targeted exome capture and sequenced on paired-end mode (150 bp) on an Illumina NovaSeq 6000 platform.