Data from: Demographic model selection using random forests and the site frequency spectrum
Data files
Jul 21, 2017 version files 9.92 GB
-
Barcodes_Grp1_Mar2016_MEC-17-0128.txt
80 B
-
Barcodes_Grp1_Nov2015_MEC-17-0128.txt
22 B
-
Barcodes_Grp2_Mar2016_MEC-17-0128.txt
101 B
-
Barcodes_Grp3_Mar2016_MEC-17-0128.txt
109 B
-
Barcodes_Grp4_Mar2016_MEC-17-0128.txt
97 B
-
Barcodes_Grp5_Mar2016_MEC-17-0128.txt
115 B
-
Grp1_i06_Nov2015.fastq.gz
920.68 MB
-
Grp1_i12_Mar2016.fastq.gz
350.35 MB
-
Grp2_i03_Mar2016.fastq.gz
940.93 MB
-
Grp3_i04_Mar2016.fastq.gz
450.11 MB
-
Grp4_i05_Mar2016.fastq.gz
427.23 MB
-
Grp5_i06_Mar2016.fastq.gz
514.51 MB
-
Haplo_July2016_77Samples_p60.alleles
14.22 MB
-
Haplo_July2016_77Samples_p60.loci
7.18 MB
-
Haplo_July2016_77Samples_p60.snps
928.58 KB
-
Haplo_July2016_77Samples_p60.unlinked_snps
134.22 KB
-
i1_barcodes_July2016_MEC-17-0128.txt
113 B
-
i12_barcodes_July2016_MEC-17-0128.txt
69 B
-
i2_barcodes_July2016_MEC-17-0128.txt
94 B
-
i3_barcodes_July2016_MEC-17-0128.txt
116 B
-
i4_barcodes_July2016_MEC-17-0128.txt
108 B
-
i5_barcodes_July2016_MEC-17-0128.txt
116 B
-
i6_barcodes_July2016_MEC-17-0128.txt
112 B
-
i7_barcodes_July2016_MEC-17-0128.txt
110 B
-
i8_barcodes_July2016_MEC-17-0128.txt
96 B
-
params_ex.txt
3.82 KB
-
ZBCi1-V1T-1_S42_L006_R1_001.fastq.gz
520.58 MB
-
ZBCi12-V1T-1_S50_L006_R1_001.fastq.gz
1.07 GB
-
ZBCi2-V1T-1_S43_L006_R1_001.fastq.gz
646.16 MB
-
ZBCi3-V1T-1_S44_L006_R1_001.fastq.gz
602.50 MB
-
ZBCi4-V1T-1_S45_L006_R1_001.fastq.gz
654.09 MB
-
ZBCi5-V1T-1_S46_L006_R1_001.fastq.gz
789.09 MB
-
ZBCi6-V1T-1_S47_L006_R1_001.fastq.gz
540.10 MB
-
ZBCi7-V1T-1_S48_L006_R1_001.fastq.gz
699.35 MB
-
ZBCi8-V1T-1_S49_L006_R1_001.fastq.gz
775.16 MB
Abstract
Phylogeographic data sets have grown from tens to thousands of loci in recent years, but extant statistical methods do not take full advantage of these large data sets. For example, approximate Bayesian computation (ABC) is a commonly used method for the explicit comparison of alternate demographic histories, but it is limited by the “curse of dimensionality” and issues related to the simulation and summarization of data when applied to next-generation sequencing (NGS) data sets. We implement here several improvements to overcome these difficulties. We use a Random Forest (RF) classifier for model selection to circumvent the curse of dimensionality and apply a binned representation of the multidimensional site frequency spectrum (mSFS) to address issues related to the simulation and summarization of large SNP data sets. We evaluate the performance of these improvements using simulation and find low overall error rates (~7%). We then apply the approach to data from Haplotrema vancouverense, a land snail endemic to the Pacific Northwest of North America. Fifteen demographic models were compared, and our results support a model of recent dispersal from coastal to inland rainforests. Our results demonstrate that binning is an effective strategy for the construction of a mSFS and imply that the statistical power of RF when applied to demographic model selection is at least comparable to traditional ABC algorithms. Importantly, by combining these strategies, large sets of models with differing numbers of populations can be evaluated.