Benefits and limits of phasing alleles for network inference of allopolyploid complexes
Abstract
Accurately reconstructing the reticulate histories of polyploids remains a central challenge for understanding plant evolution. Although phylogenetic networks can provide insights into relationships among polyploid lineages, inferring networks may be hindered by the complexities of homology determination in polyploid taxa. We use simulations to show that phasing alleles from allopolyploid individuals can improve phylogenetic network inference under the multispecies coalescent by obtaining the true network with fewer loci compared to haplotype consensus sequences or sequences with heterozygous bases represented as ambiguity codes. Phased allelic data can also improve divergence time estimates for networks, which is helpful for evaluating allopolyploid speciation hypotheses and proposing mechanisms of speciation. To achieve these outcomes in empirical data, we present a novel pipeline that leverages a recently developed phasing algorithm to reliably phase alleles from polyploids. This pipeline is especially appropriate for target enrichment data, where depth of coverage is typically high enough to phase entire loci. We provide an empirical example in the North American Dryopteris fern complex that demonstrates insights from phased data as well as the challenges of network inference. We establish that our pipeline (PATÉ: Phased Alleles from Target Enrichment data) is capable of recovering a high proportion of phased loci from both diploids and polyploids. These data may improve network estimates compared to using haplotype consensus assemblies by accurately inferring the direction of gene flow, but statistical non-identifiability of phylogenetic networks poses a barrier to inferring the evolutionary history of reticulate complexes.
README: Benefits and Limits of Phasing Alleles for Network Inference of Allopolyploid Complexes
https://doi.org/10.5061/dryad.5qfttdz53
Contents include the simulated sequence data as well as the empirical Dryopteris data and some of the files used for analyses. Most of the code used throughout the manuscript is cited and available on github. However, we include a static version of PATÉ used at the time of writing and the control file options used for our analyses.
Description of the data and file structure
simulation
t1 - simulations for tau_h = 0.001 and theta = tau_h
t2 - simulations for tau_h = 0.001 and theta = 2 * tau_h
t3 - simulations for tau_h = 0.001 and theta = 3 * tau_h
t4 - simulations for tau_h = 0.001 and theta = 4 * tau_h
t5 - simulations for tau_h = 0.01 and theta = tau_h
t6 - simulations for tau_h = 0.01 and theta = 2 * tau_h
t7 - simulations for tau_h = 0.01 and theta = 3 * tau_h
t8 - simulations for tau_h = 0.01 and theta = 4 * tau_h
t9 - simulations for tau_h = 0.1 and theta = tau_h
t10 - simulations for tau_h = 0.1 and theta = 2 * tau_h
t11 - simulations for tau_h = 0.1 and theta = 3 * tau_h
t12 - simulations for tau_h = 0.1 and theta = 4 * tau_h
consensus.pl - script for creating random haplotype consesnus data from simulated sequences
mkphy.pl - reformat simulated sequences to phylip format for gene tree estimation
pickone.pl - randomly select only one similated sequence per species
pn.consensus.jl.template - julia script used for SNaQ with consensus data
pn.phase.jl.template - julia script used for phased data
pn.pickone.jl.template - julia script used for pick one data
pn.true.jl.template - julia script used for known gene trees
pn.unphase.jl.template - julia script used for sequences with IUPAC codes
pnDriver.pl - creates the julia scripts for various simulated data sets
unphase.pl - collapses the simulated sequences per species into a single sequence with IUPAC codes
PATE
- phasedData - A copy of PATE at the time of writing. See more information at https://github.com/gtiley/Phasing
- PATE.pl - the main PATE script
- PATE.ctl - the PATE control file
- ploidy.txt - the ploidy file
- template.sh - the template submission script
- helperScripts - helper functions
- summaryStatsOutput-ALL - summary statistics for individuals
- referenceSequences
- .fasta - the supercontig fasta output for all indidivuals by locus
Dryopteris_analyses
all_taxa
- consensus - alignments, gene trees, species trees, and networks for consensus data
- genotype - alignments, gene trees, species trees, and networks for gentyped data
- phased - alignments, gene trees, species trees, and networks for phased data
- pickone - alignments, gene trees, species trees, and networks for pickone data
five_taxa - similar structure to above with alignments, gene trees, and networsk, but an extra directory level for the three focal allopolyploid tests
- consensus
- CGC - D. clintoniana is focal allopolyploid
- IEC - D. campyloptera is focal allopolyploid
- LGC - D. celsa is focal allopolyploid
- genotype
- phased
- pickone
three_taxa
- consensus
- Alignments - alignment files
- CGC.bppDat - alignments reformatted into BPP's phylip format
- CGC.spMap - the indidivual to species map for BPP
- Converge - Indepdendent runs for models from BPP used to check convergence and estimate parameters prior to Bayes factor analyses.
- 1 - model 1
- 2 - model 2
- 3 - model 3
- 4 - model 4
- 5 - model 5
- 6 - model 6
- genctl.pl - generate the control files for all models
- runBPP.pl - create submission scripts and run BPP
- genotyoe
- phased
- pickone
Sharing/Access information
Raw reads for empirical analyses are associate with NCBI BioProject PRJNA725004. Please consult the manuscript for details.
Code/Software
Additional manuscript resources can be found at:
Usage notes
Included is a single pdf of the supplementary material as well as a tarball with all of the data used in the manuscript.
Download the tarball and within is a readme detailing contents including:
1) Control files and simulated sequence data for the single allotetraploid with BPP
2) A static release of the PATÉ pipeline used for phasing Dryopteris sequences in the manuscript
3) Dryopteris data used for analyses
4) BPP Control files, Julia scripts, and notes for repeating some of the empirical analyses