Skip to main content
Dryad

The choices we make and the impacts they have: Machine learning and species delimitation in North American box turtles (Terrapene spp.)

Cite this dataset

Martin, Bradley T. et al. (2020). The choices we make and the impacts they have: Machine learning and species delimitation in North American box turtles (Terrapene spp.) [Dataset]. Dryad. https://doi.org/10.5061/dryad.xgxd254fc

Abstract

Model-based approaches that attempt to delimit species are hampered by computational limitations as well as the unfortunate tendency by users to disregard algorithmic assumptions. Alternatives are clearly needed, and machine-learning (M-L) is attractive in this regard as it functions without the need to explicitly define a species concept. Unfortunately, its performance will vary according to which (of several) bioinformatic parameters are invoked. Herein, we gauge the effectiveness of M-L-based species-delimitation algorithms by parsing 64 variably-filtered versions of a ddRAD-derived SNP dataset involving North American box turtles (Terrapene spp.). Our filtering strategies included: (A) minor allele frequencies (MAF) of 5%, 3%, 1%, and 0% (=none), and (B) maximum missing data per-individual/per-population at 25%, 50%, 75%, and 100% (=none). We found that species-delimitation via unsupervised M-L impacted the signal-to-noise ratio in our data, as well as the discordance among resolved clades. The latter may also reflect biogeographic history, gene flow, incomplete lineage sorting, or combinations thereof (as corroborated from previously observed patterns of differential introgression). Our results substantiate M-L as a viable species-delimitation method, but also demonstrate how commonly observed patterns of phylogenetic discord can seriously impact M-L-classification.

Usage notes

.
├── depthPlot.tar.gz
├── divergenceDating
│   ├── inputFiles.tar.gz
│   ├── outputFiles.tar.gz
│   └── run_dating_rooted_fixedDates_iqtree.sh
├── speciesDelim
│   ├── BFD
│   │   ├── Rscripts
│   │   │   └── mle2bfd.Rmd
│   │   ├── inputFiles
│   │   │   ├── DIVEIN.tar.gz
│   │   │   └── bfd_xml_input.tar.gz
│   │   └── outputFiles
│   │       └── bfd_output.tar.gz
│   ├── Robjects
│   │   └── Robjects_missData_maf.tar.gz
│   ├── barplots
│   │   └── barplots_all.tar.gz
│   ├── cmds_isomds_tsne
│   │   ├── alignments
│   │   │   └── alignments_R.tar.gz
│   │   └── output
│   │       └── raw_txt_output
│   │           └── missData_maf_output_R.tar.gz
│   ├── delimitR
│   │   ├── Rscripts
│   │   │   ├── estimate_nm.R
│   │   │   ├── run_delimitR_linux.R
│   │   │   ├── run_delimitR_linux_2.R
│   │   │   ├── run_delimitR_linux_2_rf5K.R
│   │   │   └── writeTables_dr.R
│   │   ├── dr.pbs
│   │   ├── inputFiles
│   │   │   ├── alignments
│   │   │   │   └── delimitR_input.recode.vcf
│   │   │   ├── jSFS
│   │   │   │   └── delimitR_input_MSFS.obs
│   │   │   ├── popmap
│   │   │   │   └── popmap_final_delimitR_sorted.txt
│   │   │   └── traits
│   │   │       └── traits_dr_final.txt
│   │   └── outputFiles
│   │       ├── Prior
│   │       │   └── binned.tar.gz
│   │       ├── Robjects
│   │       │   ├── myRF_object_svdtree_5K.rds
│   │       │   ├── myobserved_object_svdtree_5K.rds
│   │       │   └── prediction_object_svdtree_5K.rds
│   │       └── tables
│   │           ├── myRF_errorRates_5K.csv
│   │           └── prediction_votes_5K.csv
│   └── vae
│       ├── alignments
│       │   └── alignments_vae.tar.gz
│       └── output
│           └── raw_txt_output
│               └── missData_maf_output_vae.tar.gz
├── speciesTrees
│   ├── iqtree
│   │   ├── alignments
│   │   │   └── BOX_concat_goodOnly_N214.nex
│   │   ├── constraintTrees.tar.gz
│   │   ├── partitionTree.tar.gz
│   │   └── sCF
│   │       ├── inputFiles.tar.gz
│   │       ├── outputFiles.tar.gz
│   │       └── run_scf.sh
│   ├── pomo
│   │   ├── alignments
│   │   │   ├── BOX_pomo_FINAL.counts
│   │   │   └── BOX_pomo_FINAL.fasta
│   │   └── outputFiles.tar.gz
│   └── svdquartets
│       ├── alignments
│       │   └── BOX_svdq_run3_FINAL_filt.nex
│       └── outputFiles.tar.gz
└── treemix
    ├── inputFiles.tar.gz
    └── outputFiles.tar.gz

36 directories, 43 files