SNP matrices and vcf files for phylogenetic, genetic structure and historical demographic analyses of Podocarpus from Hispaniola
Nieto-Blazquez, Maria Esther (2021), SNP matrices and vcf files for phylogenetic, genetic structure and historical demographic analyses of Podocarpus from Hispaniola, Dryad, Dataset, https://doi.org/10.5061/dryad.37pvmcvm3
Aim: Hispaniola is the second largest island in the Caribbean and a hotspot of biodiversity. The island was formed by the fusion of a northern and southern palaeo-islands during the mid-Miocene (15 Ma). The historical split of Hispaniola together with repeated marine incursions during the Pleistocene are known to have influenced lineage divergence and genetic structure in a few birds and mammals, but the effect on vascular plants is less understood. The conifer genus Podocarpus has two species, P. hispaniolensis and P. buchii, that are endemic to the mountains of Hispaniola and are IUCN endangered. The former occurs in the mountains of the north, and the latter in the south, with a region of sympatry in the Central Cordillera. Here we evaluate the historical split of the two palaeo-islands, and repeated marine incursions as dispersal barriers to the geographical distribution of genetic diversity, genetic structure, divergence patterns, and the historical demography of the two species.
Location: Hispaniola island, Caribbean.
Methods: Using Genotyping by Sequencing in 47 Podocarpus samples we identified two sets of single nucleotide polymorphisms for our analyses (74,260 and 22,657 SNPs).
Results: The results show a population genetic structure that corresponds to the geographic distribution of the species in mountainous areas. Podocarpus in Hispaniola followed a stepping-stone colonization pattern with bottlenecks at each mountain colonization event.
Main conclusions: The historical events in question did not seem to have influenced the genetic structure, diversity, or demography of Podocarpus, instead the current geographic barriers imposed by lowland xeric valleys did. The clear divergence between species together with the elevated within-population genetic diversity and significant genetic structure call for a multi-population in situ conservation of each species.
We conducted SNPs discovery with Stacks 1.47 (Catchen et al. 2013). We demultiplexed and filtered sequencing reads using the process_radtags function. We trimmed reads to 64 bp length, and eliminated uncalled reads and reads with low quality scores (phred score of 10). Since we did not have a reference genome, we conducted a ‘de novo’ assembly of loci with the module ‘ustacks’. In this step, reads were aligned into matching blocks, or stacks per sample. Then the stacks were compared, a set of putative loci produced and SNPs detected at each locus. We followed the recommendations of Paris et al. (2017) on the selection of parameter values through the pipeline, keeping those that rendered the highest number of loci. We used -m (minimum depth coverage to create a stack) = 5; -M (maximum distance allowed to create a stack) = 3; -p (parallel execution of several threads) = 15; and the –gapped option which allowed gapped alignments between stacks. Subsequently the module ‘cstacks’ built a catalog from the loci produced in ‘ustacks’ by merging stacks with at most 3 (-n = 3) mismatches between loci, and allowed parallel execution of several threads (-p = 15). The module ‘sstacks’ matches the loci produced by ‘ustacks’ with the catalog produced by ‘cstacks’, and again parallel execution of several threads (-p = 15) was allowed. We used the ‘populations’ module of Stacks to obtain population level summary statistics, genetic diversity indexes (i.e. observed and expected heterozygosity, nucleotide polymorphism diversity), FST values, and a SNP dataset. We chose the following parameters: -p (minimum number of populations containing each locus) = 1; -r (minimum percentage of individuals present in a population to process a locus for that population) = 0.4; and –min_maf (minimum allele frequency) = 0.1. We ran the ‘populations’ module of Stacks twice. In the first run we assigned samples to the 11 collection sites plus the outgroup, and in the second we assigned samples to five populations. The first SNP dataset was used to run a phylogenetic reconstruction, while the second to run the genetic structure, and demographic history analyses.