# Data from: Genetic signatures of lineage fusion closely resemble population decline

## Cite this dataset

Garrick, Ryan (2023). Data from: Genetic signatures of lineage fusion closely resemble population decline [Dataset]. Dryad. https://doi.org/10.5061/dryad.1jwstqk11

## Abstract

Accurate interpretation of the genetic signatures of past demographic events is crucial for reconstructing evolutionary history. Lineage fusion (complete merging, resulting in a single panmictic population) is a special case of secondary contact that is seldom considered. Here, the circumstances under which lineage fusion can be distinguished from population size constancy, growth, bottleneck, and decline were investigated. Multi-locus haplotype data were simulated under models of lineage fusion with different divergence versus sampling lag times (D:L ratios). These pseudo-observed datasets also differed in their allocation of a fixed amount of sequencing resources (number of sampled alleles, haplotype length, number of loci). Distinguishability of lineage fusion versus each of 10 untrue non-fusion scenarios was quantified based on six summary statistics (neutrality tests). Some datasets were also analyzed using extended Bayesian skyline plots. Results showed that signatures of lineage fusion very closely resemble those of decline—high distinguishability was generally limited to the most favorable scenario (D:L = 9), using the most sensitive summary statistics (*F*_{S} and *Z*_{nS}), coupled with the optimal sequencing resource allocation (maximizing number of loci). Also, extended Bayesian skyline plots often erroneously inferred population decline. Awareness of the potential for lineage fusion to carry the hallmarks of population decline is critical.

## README: Genetic signatures of lineage fusion closely resemble population decline

All pseudo-observed datasets (PODs) were simulated in DIY-ABC v2.1.0 (Cornuet et al., 2014) using the HKY model of nucleotide sequence evolution (Hasegawa et al., 1985) with proportion of invariant sites = 10% and gamma = 2.0, and a mutation rate () of 110-7 substitutions per site per generation.

Summary statistic distributions were estimated via 1,000 simulations using the Coalescent Simulations (n-loci | 1-pop) feature in DnaSP V6.12.03 (Rozas et al., 2017). Simulations were seeded using values for per gene (Watterson, 1975), number of sampled individuals, DNA sequence haplotype length, and number of diploid autosomal loci from the associated POD.

A subset of PODs were analyzed using Extended Bayesian Skyline Plots (EBSPs; Heled & Drummond, 2008), implemented in BEAST V2.7.3 (Bouckaert et al., 2014), using the true model nucleotide sequence evolution and mutation rate that generated the PODs, clock model = strict, operator weights = auto-optimized, with other priors as default. Searches were conducted using 2.5108 Markov Chain Monte Carlo generations, sampling parameters every 5,000th step, discarding 10% as burn-in.

### Description of the data and file structure

Folder: Simulated_PODs.zip

Simulated pseudo-observed datasets (PODs) composed of aligned multi-locus DNA haplotypes, generated using DIY-ABC v2.1.0 (Cornuet et al., 2014).

PODs were simulated under four alternative lineage fusion scenarios that differed in their pre-fusion divergence time vs. post-fusion sampling lag time (D:L ratios = 1, 3, 5 or 9), each represented by a different folder. See that main for full details on simulation parameter settings. Hierarchically nested subfolders then partition PODs according number of sampled alleles (n#), followed by haplotype alignment length in base pairs (#bp) and number of independent diploid autosomal loci (#loc). PODs are provided in .alleles format, which are compatible with DnaSP V6.12.03 (Rozas et al., 2017).

Note that there are five replicate PODs for datasets containing 40 sampled alleles (n40) coupled with 50 loci (50loc), as this expanded set was used for demographic hypothesis testing (e.g., Extended Bayesian Skyline Plot [EBSP] analyses; Heled & Drummond, 2008). In these cases, file names also include a suffix indicating replicate number (001, 002... 005.alleles).

References:

Cornuet J-M, Pudlo P, Veyssier J, Dehne-Garcia A, Gautier M, Leblois R, Marin J-M, Estoup A (2014) DIYABC v2.0: A software to make approximate Bayesian computation inferences about population history using single nucleotide polymorphism, DNA sequence and microsatellite data. Bioinformatics, 30, 11871189.

Heled, J., & Drummond, A. J. (2008). Bayesian inference of population size history from multiple loci. BMC Evolutionary Biology, 8, 289.

Rozas, J., Ferrer-Mata, A., Sanchez-DelBarrio, J. C., Guirao-Rico, S., Librado, P., Ramos-Onsins, S. E., & Sanchez-Gracia, A. (2017). DnaSP 6: DNA sequence polymorphism analysis of large data sets. Molecular Biology and Evolution, 34, 32993302.

Excel spreadsheet: Summary_statistic_CIs

Spreadsheet with all estimated summary statistic 90% confidence intervals (CIs).

Pseudo-observed datasets (PODs) were simulated under four alternative lineage fusion scenarios that differed in their pre-fusion divergence time vs. post-fusion sampling lag time (D:L ratios). Data associated with each these are contained within a separate worksheet: DL1, DL3, DL5 and DL9.

Within a worksheet, each POD and demographic scenario are labeled with the following elements: DL ratio (DL#), number of sampled alleles (n#), haplotype alignment length in base pairs (#bp), number of independent diploid autosomal loci (#loc), and demographic scenario under which summary statistic distributions were estimated via the Coalescent Simulations (n-loci | 1-pop) feature in DnaSP V6.12.03 (Rozas et al., 2017), seeded using basic characteristics of a given POD. These scenarios are abbreviated as follows: size constancy (Const), growth (Grow, with suffix 2, 3 or 4 indicating intensity of 2, 3, or 4 of the base Ne), bottleneck (Bott, with suffix 2, 3 or 4 indicating intensity of 0.5, 0.33, or 0.25 base Ne), decline (Decn, with suffix 2, 3 or 4 indicating intensity of 0.5, 0.33, or 0.25 base Ne), or lineage fusion (Fuse, the "true" scenario).

Within a worksheet, the estimated 90% CI for each of six summary statistics (see Table S1 associated with the main text for explanation of notation) is reported as the lower 5% and upper 95% bound. The proportion (Prop.) of the lineage fusion (Fuse) scenario's 90% CI that is unique (Unq), compared to a given non-fusion scenario's 90% CI, was calculated in two steps: first at the lower tail (Lower Unq), and then at the upper tail (Upper Unq). Two these two proportions were then summed to generate the total (Tot) proportion of the lineage fusion scenario's 90% CI that is unique.

References:

Rozas, J., Ferrer-Mata, A., Sanchez-DelBarrio, J. C., Guirao-Rico, S., Librado, P., Ramos-Onsins, S. E., & Sanchez-Gracia, A. (2017). DnaSP 6: DNA sequence polymorphism analysis of large data sets. Molecular Biology and Evolution, 34, 32993302.

Folder: EBSP_analyses.zip

This folder contains files used to run Extended Bayesian Skyline Plot (EBSP) analyses (Heled & Drummond, 2008).

Pseudo-observed datasets (PODs) were simulated under four alternative lineage fusion scenarios that differed in their pre-fusion divergence time vs. post-fusion sampling lag time (D:L ratios = 1, 3, 5 or 9). All PODs analyzed via EBSPs were composed of 40 sampled alleles (n40), 400 base pair haplotype alignments (400bp), and 50 independent diploid autosomal loci (50loc). Subfolder names (e.g., "DL1_n40_400bp_50loc") reflect these genetic dataset characteristics.

Five replicate PODs per D:L scenario were simulated, and .xml files for each (suffix 001, 002... 005.xml) were prepared using BEAUTI, and then run with BEAST V2.7.3 (Bouckaert et al., 2014). See that main text for details on EBSP run settings and interpretation of output.

References:

Bouckaert, R., Heled, J., Khnert, D., Vaughan, T., Wu, C. -H., Xie, D., Drummond, A. J. (2014). BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Computational Biology, 10, e1003537.

Heled, J., & Drummond, A. J. (2008). Bayesian inference of population size history from multiple loci. BMC Evolutionary Biology, 8, 289.

### Sharing/Access information

Data was derived from the following sources: DNA sequences were simulated using the software DIY-ABC v2.1.0 (Cornuet et al., 2014).

## Methods

All pseudo-observed datasets (PODs) were simulated in DIY-ABC v2.1.0 (Cornuet et al., 2014) using the HKY model of nucleotide sequence evolution (Hasegawa et al., 1985) with proportion of invariant sites = 10% and gamma = 2.0, and a mutation rate (*µ*) of 1×10^{-7} substitutions per site per generation. Summary statistic distributions were estimated via 1,000 simulations using the “Coalescent Simulations (n-loci | 1-pop)” feature in DnaSP V6.12.03 (Rozas et al., 2017). Simulations were seeded using values for θ per gene (Watterson, 1975), number of sampled individuals, DNA sequence haplotype length, and number of diploid autosomal loci from the associated POD. A subset of PODs were analyzed using Extended Bayesian Skyline Plots (EBSPs; Heled & Drummond, 2008), implemented in BEAST V2.7.3 (Bouckaert et al., 2014), using the true model nucleotide sequence evolution and mutation rate that generated the PODs, clock model = strict, operator weights = auto-optimized, with other priors as default. Searches were conducted using 2.5×10^{8} Markov Chain Monte Carlo generations, sampling parameters every 5,000^{th} step, discarding 10% as burn-in.

## Funding

National Science Foundation, Award: 1738817, EPSCoR RII Track-4