README.txt file was updated on 8-August-2022 by Brian T. Smith Supplementary figures, tables and data files from: Phylogenomic analysis of the parrots of the world distinguishes artifactual from biological sources of gene tree discordance Authors: Brian Tilston Smith Department of Ornithology, American Museum of Natural History, Central Park West at 79th Street, New York, NY 10024, USA Jon Merwin Department of Ornithology, Academy of Natural Sciences of Drexel University, 1900 Benjamin Franklin Parkway, Philadelphia, PA 19103, USA Department of Biodiversity, Earth, and Environmental Science, Drexel University, 3245 Chestnut St, Philadelphia, PA 19104, USA Kaiya L. Provost Department of Evolution, Ecology, and Organismal Biology, The Ohio State University, 318 W. 12th Avenue, Columbus, OH 43210, USA Gregory Thom Museum of Natural Science, Louisiana State University, 119 Foster Hall, Baton Rouge, LA 70803, USA Department of Biological Sciences, Louisiana State University, 202 Life Science Bldg, Baton Rouge, LA 70803, USA Robb T. Brumfield Museum of Natural Science, Louisiana State University, 119 Foster Hall, Baton Rouge, LA 70803, USA Department of Biological Sciences, Louisiana State University, 202 Life Science Bldg, Baton Rouge, LA 70803, USA Mateus Ferreira Centro de Estudos da Biodiversidade, Universidade Federal de Roraima, Av. Cap. Ene Garcez, 2413, Boa Vista, RR, Brazil William M. Mauck III Department of Ornithology, American Museum of Natural History, Central Park West at 79th Street, New York, NY 10024, USA Robert G. Moyle Department of Ecology and Evolutionary Biology and Biodiversity Institute, University of Kansas, 1345 Jayhawk Blvd., Lawrence, KS 66045, USA Timothy F. Wright Department of Biology, MSC 3AF, New Mexico State University, Las Cruces, NM, 88003, USA Leo Joseph Australian National Wildlife Collection, National Research Collections Australia, CSIRO, GPO Box 1700, Canberra, ACT, 2601, Australia DATA & FILE OVERVIEW Description of dataset These data were collected to infer phylogenetic relationships among parrots. This submission includes all the raw data and scripts/commands to reproduce the results presented in the study and the supplementary materials cited in the paper. List of Directories Name: Supplementary Tables Description: Supplementary Tables S1-S4 with legends Name: Supplementary Figures Description: Supplementary figures 1-27 Name: Commands and scripts Description: Commands and scripts for analzying the dataset Name: full-locus-alignments Description: FASTA alignments for the full datasets Name: subclade_data Description: There are two folders for each subclade. One is labeled []_fasta, which include the locus alignments for that subclade. The second one is labeled []_likelihood, which includes all the input and output files for gene-wise likelihood analyses and estimating select summary stats Name:au_tests Description: Log files from AU topology tests Name:newick_trees Description: Full and filtered concatenated and species tree topolgies in newick format Name:treePL Description: Input and output for molecular dating in treePL Name: Summary Stats Description: CSV files with summary stats Name: Tree distances Description: Distances among trees Name: KNN Modeling Description: Input and output files for KNN modeling Name: Concatenated trees Description: Input and output files for IQTREE concatenated tree inference for full and filtered datasets Name: Species trees Description: Input and output files for ASTRAL species tree inference for full and filtered datasets List of files and subdirectories within main directories: Directory: Supplementary Tables Name: Supplementary Table Legends.pdf Description: Legends for Tables S1-S4 Name: Supplementary_Table_S1.csv Description: Table summarizing samples used in this study with locality data and accession numbers. Name: Supplementary_Table_S2.csv Description: Table containing the deviation in parsimony sites (dPIS) and number of missing loci for each tip in the phylogeny. Name: Supplementary_Table_S3.csv Description: Table summarizing output from AU topology tests and gene-wise log-likelihood test for each subclade. Name: Supplementary_Table_S4.csv Description: KNN predictive modeling output for each subclade using gene-wise log-likelihood scores (a) and Robinson Foulds distances (b) as dependent variable. Directory: Supplementary Figures Name: Supplementary figures 1-27 Description: Supplementary figures 1-27 with embedded legends Directory: Commands and scripts Name: Parrots_UCE_variant_calling_and_other_methods.txt Description: Main command file for bioinformatic UCE processing and phylogenetic workflow Name: 2021-parrot-subclade-KNN-delta-lk.R Description: R script for running delta likelihood KNN models a 100 times. Name: 2021-parrot-subclade-KNN-RF-dist-st.R Description: R script for running Robinson-Foulds distance KNN models a 100 times. Name: estimate_PIS.R Description: R script for estimating deviation in parsimony informative sites Name: GenesFromSites.pl Description: Converts site likelihoods to gene likelihoods. Script from: Walker J.F., Brown J.W., Smith S.A. 2018. Analyzing contentious relationships and outlier genes in phylogenomics. Syst. Biol.67(5):916–924. Name: RF_distances_filtered_trees.R Description: R script to estimate Robinson-Foulds distances among the complete and filtered trees Name: RF_distances_subclade_gene_trees.R Description: R script to estimate Robinson-Foulds distances for subclades among the concatenated and species/alternative topologies Name: subclade_summary_stats.R Description: R script for estimate summary stats for each of the 24 subclades Name: trim_ambig.py Description: Python script to remove individuals in each locus alignment with X% missing data Name: uce_adjusted_fixed2.pl Description: Perl script to add | UCE_[number] to after each sample ID in a fasta alignment Name: AlignStatsScriptWindowsPort.R Description: R script for estimating and aligning summary stats. Modifed script from Burbrink, F.T., Grazziotin, F.G., Pyron, R.A., Cundall, D., Donnellan, S., Irish, F., Keogh, J.S., Kraus, F., Murphy, R.W., Noonan, B. and Raxworthy, C.J., 2020. Interrogating genomic-scale data for Squamata (lizards, snakes, and amphisbaenians) shows no support for key traditional morphological relationships. Systematic biology, 69(3), 02-520. Name: KNN_Heatmap_revision.R Description: R script for plotting the Figure 3 heatmaps Name: Plot_genera_monophyly_heat_map.R Description: R script for plotting the heat map genera monophyly shown in Supplementary Figure S6 Name: Plot_metrics_on_trees_w_taxonomy.R Description: R script for plotting Supplementary Figure S1. Name: plot_conflicting_nodes_on_1_concat_tree.R Description: R script for highlighting conflicting nodes on the concatenated tree. Shown in Figure 2a. Name: Plot_metric_distributions.R Description: R script for plotting dPIS and number of missing loci per sample Name: Plot_tree_differences_rf_heatmap.R Description: R script for plotting heat map of Robinson-Foulds distances among full and filter trees shown in Supplementary Figure S19. Name: Subclade_speciestree_vs_concatenated-updated.R Description: R script for extracting subclades from concatenated and species trees. Name: Au_test_data_summary.R Description: R script for summarizing AU topology test scores Name: rename_tips.sh Description: Script to rename tips in trees that are abbreviated. To edit a specific tree, change the target tree name in the *.sh file. Directory: full-locus-alignments Name: [locus_name].fasta Description: 3,222 locus alignments in FASTA format produced from PHYLUCE where each locus has a minimum of 75% included. Alignments are for the full dataset Directory: subclade_data Name: []_fasta Description: There is a []_fasta subdirectory for each subclade that contains its locus alignments Name: []_likelihood Description: There is a []_likelihood subdirectory for each subclade the input and output files for gene-wise likelihood analyses and estimating select summary stats Directory: au_tests Name: []_AU.log file for each subclade Description: Log files from AU topology test between the concatenated and species/alternative tree versus gene trees for each subclade Directory:newick_trees Name: mafft-fasta-clean-species-1-filtering-75p-iqtree.tre Description: IQTREE tree estimated from the filtered > 1.31 dPIS concatenated alignment Name: mafft-fasta-clean-species-5-filtering-75p-iqtree.tre Description: IQTREE tree estimated from the filtered > 5.30 dPIS concatenated alignment Name: mafft-fasta-clean-species-7-filtering-75p-iqtree.tre Description: IQTREE tree estimated from the filtered > 7.43 dPIS concatenated alignment Name: mafft-fasta-clean-species-170-filtering-75p-iqtree.tre Description: IQTREE tree estimated from the filtered > 169.8 missing loci concatenated alignment Name: mafft-fasta-clean-species-280-filtering-75p-iqtree.tre Description: IQTREE tree estimated from the filtered > 280.4 missing loci concatenated alignment Name: mafft-fasta-clean-species-359-filtering-75p-iqtree.tre Description: IQTREE tree estimated from the filtered > 358.6 missing loci concatenated alignment Name: mafft-nexus-clean-species-full-75p-iqtree.tre Description: IQTREE tree estimated from the full concatenated alignment Name: mafft-fasta-clean-species-1-filtering-75p-alrt0.astral.tre Description: ASTRAL tree estimated from the filtered > 1.31 dPIS dataset Name: mafft-fasta-clean-species-5-filtering-75p-alrt0.astral.tre Description: ASTRAL tree estimated from the filtered > 5.30 dPIS dataset Name: mafft-fasta-clean-species-7-filtering-75p-alrt0.astral.tre Description: ASTRAL tree estimated from the filtered > 7.43 dPIS dataset Name: mafft-fasta-clean-species-170-filtering-75p-alrt0.astral.tre Description: ASTRAL tree estimated from the filtered > 169.8 missing loci dataset Name: mafft-fasta-clean-species-280-filtering-75p-alrt0.astral.tre Description: ASTRAL tree estimated from the filtered > 280.4 missing loci dataset Name: mafft-fasta-clean-species-359-filtering-75p-alrt0.astral.tre Description: ASTRAL tree estimated from the filtered > 358.6 missing loci dataset Name: mafft-fasta-clean-all-species-75p-alrt0.astral.tre Description: ASTRAL tree estimated from the full dataset Subdirectory Name: astral_hpc Description: Shell scripts for each dataset to run ASTRAL on a HPC Subdirectory Name: iqtree_hpc Description: Shell scripts for each dataset to run IQTREE on a HPC. The full dataset was run on a local machine. Directory: treePL Name: 7-july-2022-mcc-common-ancestor-heights-mafft-nexus-clean-75p.charsets-smooth-1.rooted.100.wo.snethlageae.ufboot.range.tre Description: Name: all-parrot-5-calibrations-100-bootstraps-wo-snethlageae.conf Description: treePL configuratiuon file Name: mafft-nexus-clean-75p.charsets-smooth-1.rooted.100.ufboot.wo.snethlageae.tre Description: Dated bootstrap trees Directory: Summary Stats Name:1_main_stats_Final.csv Description: Summary stats for each tip Name: 1_all_data_PIS.csv Description: The number of parsimony informative sites per locus for each tip. Directory: Tree distances Name: combined.tree.distances.csv Description: Pairwise matrix of normalized Robinson-Foulds distances among full and filtered concatenated and species trees. Directory: KNN Modeling Subdirectory Name: Input Name: [].csv Description: CSV file for each subclade with the per locus summary stats for each subclade used for the KNN modeling Subdirectory Name: Output Subdirectory: delta-likelihood Name: combined_deltalk.csv Description: Delta likelihood KNN model output for each subclade. Included are mean variable importance and metrics on model performance Subdirectory: RF-distance-gt-st Name: combined_sp_rf.csv Description: Robinson-Foulds distance KNN model output for each subclade. Included are mean variable importance and metrics on model performance Directory: Concatenated trees Subdirectory Name: mafft-fasta-clean-species-1-filtering-75p-iqtree Description: Input and output files for IQTREE analysis on the filtered > 1.31 dPIS concatenated alignment Subdirectory Name: mafft-fasta-clean-species-5-filtering-75p-iqtree Description: Input and output files for IQTREE analysis on the filtered > 5.30 dPIS concatenated alignment Subdirectory Name: mafft-fasta-clean-species-7-filtering-75p-iqtree Description: Input and output files for IQTREE analysis on the filtered > 7.43 dPIS concatenated alignment Subdirectory Name: mafft-fasta-clean-species-170-filtering-75p-iqtree Description: Input and output files for IQTREE analysis on the filtered > 169.8 missing loci concatenated alignment Subdirectory Name: mafft-fasta-clean-species-280-filtering-75p-iqtree Description: Input and output files for IQTREE analysis on the filtered > 280.4 missing loci concatenated alignment Subdirectory Name: mafft-fasta-clean-species-359-filtering-75p-iqtree Description: Input and output files for IQTREE analysis on the filtered > 358.6 missing loci concatenated alignment Subdirectory Name: mafft-nexus-clean-species-full-75p-iqtree Description: Input and output files for IQTREE analysis on the full concatenated alignment Directory: Species trees Subdirectory Name: astral_1 Description: Input and output files for ASTRAL analysis on the filtered > 1.31 dPIS dataset Subdirectory Name: astral_5 Description: Input and output files for ASTRAL analysis on the filtered > 5.30 dPIS dataset Subdirectory Name: astral_7 Description: Input and output files for ASTRAL analysis on the filtered > 7.43 dPIS dataset Subdirectory Name: astral_170 Description: Input and output files for ASTRAL analysis on the filtered > 169.8 missing loci dataset Subdirectory Name: astral_280 Description: Input and output files for ASTRAL analysis on the filtered > 280.4 missing loci dataset Subdirectory Name: astral_359 Description: Input and output files for ASTRAL analysis on the filtered > 358.6 missing loci dataset Subdirectory Name: astral_all Description: Input and output files for ASTRAL analysis on the full dataset DATA-SPECIFIC INFORMATION FOR: ~/Supplementary_Tables/Supplementary_Table_S1.csv 1. Number of variables: 13 2. Variable List: SAMN Accession Number: Short Read Archive accession number Sample Name on Tree: Name of sample on the tree Organism: Genus and species name Genus: Genus the sample belongs to Species: Species epithet of sample Country: Coutry where sample was collected. Some samples come from captivity Locality: Specific locality of samples Date: When the sample was collected Lat: Latitude of where the sample was collected Long: Longitude of where the sample was collected Sex: Sex of the sample Institution Code: Code of the institution that provided the sample Sample Type: Samples were assigned to eithe historical (skin) or modern material (tissue) DATA-SPECIFIC INFORMATION FOR: ~/Supplementary_Tables/Supplementary_Table_S2.csv 1. Number of variables: 11 2. Variable List: Sample.Name.on.Tree: Name of sample on the tree Sample.Type: Samples were assigned to eithe historical (museum skin) or modern material (tissue) Superfamily: Superfamily the sample belongs to Family: Family the sample belongs to Subfamily: Subfamily the sample belongs to Subfamily.Node.ID: Tribe: Tribe the sample belongs to Genus: Genus the sample belongs to Species.Epithet: Species epithet the sample belongs to dpis: mean deviation in parsimony informative sites for the sample missing.loci: number of missing loci for the sample 3. Abbreviations used: dpis: deviation in parsimony informative sites DATA-SPECIFIC INFORMATION FOR: ~/Supplementary_Tables/Supplementary_Table_S3.csv 1. Number of variables: 9 2. Variable List: Taxon: Subclade used for topology tests Total_Genes_AU_Test: The total number of genes for each subclade that was used for an AU test #_GT_AU.test_CT: The number of gene trees not significantly different (alpha > 0.05) from the concatenated tree #_GT_AU.test_ST: The number of gene trees not significantly different (alpha > 0.05) from the species tree #_GT_Supporting_CT: The number of genes supporting the concatenated tree; genes with delta log-likelihood score > 2 #_GT_Supporting_ST: The number of genes supporting the concatenated tree; genes with delta log-likelihood score < -2 Total_GT_Gene_Likelihood_analysis : The total number of genes for each subclade that was used for the gene-wise log-likelihood test %_GT_Supporting_CT: The percentage of genes supporting the concatenated tree; percentage of genes with delta log-likelihood score > 2 %_GT_Supporting_ST: The percentage of genes supporting the concatenated tree; percentage of genes with delta log-likelihood score < -2 3. Abbreviations used: GT:gene tree; ST: species tree; AU.test:an approximately unbiased (AU) test DATA-SPECIFIC INFORMATION FOR: ~/Supplementary_Tables/Supplementary_Table_S4.csv 1. Number of variables: 13 2. Number of cases/rows: 2 A: delta glk KNN Predictive Modeling B: RF Distance KNN Predictive Modeling 2. Variable List: Taxon: Subclade used for KNN modeling mean_Alignment_length: importance of the variable alignment length over 100 bootstrapped KNN models mean_Missing_percent: importance of the variable percentage of missing data over 100 bootstrapped KNN models mean_Parsimony_informative_sites: importance of the variable parsimony informative sites over 100 bootstrapped KNN models mean_GC_content: importance of the variable GC content over 100 bootstrapped KNN models mean_Freq.Gaps: importance of the variable Frequency of Gaps over 100 bootstrapped KNN models mean_segSites.w.Gaps: importance of the variable segregating sites with gaps over 100 bootstrapped KNN models mean_Number.of.Sequences: importance of the variable characterizing sample size of an alignment over 100 bootstrapped KNN models mean_scaffolds: importance of the variable chromosome position of a locus over 100 bootstrapped KNN models mean_r2: the mean coefficient of determination over 100 bootstrapped KNN models sd_r2: the standard deviation of the coefficient of determination over 100 bootstrapped KNN models mean_RMSE: the mean root-mean-square-error over 100 bootstrapped KNN models sd_RMSE: the standard deviation of root-mean-square-error over 100 bootstrapped KNN models 3. Abbreviations used: mean_Freq.Gaps: mean frequency of gaps; mean_segSites.w.Gaps: mean number of segregating sites with gaps; mean_r2: mean coefficient of determination; sd_r2: the standard deviation of the coefficient of determination; mean_RMSE: the mean root-mean-square-error; sd_RMSE: the standard deviation of root-mean-square-error DATA-SPECIFIC INFORMATION FOR DIRECTORY: au_tests 1. Number of files: 24 2. Description: Log files from IQTREE2 for AU-test 3. Number of Variables: 6 4. Variables: TreeID: Tree id for the concatenated (1) and species (2) tree AU: p-value for AU test RSS: residual sum of squares d: See Shimodaira (2002) for a description c: See Shimodaira (2002) for a description locus: Each gene tree topology tested DATA-SPECIFIC INFORMATION FOR FILE: ~/tree_distances/combined.tree.distances.csv 1. Number of files: 1 2. Description: Pairwise matrix of normalized Robinson-Foulds distances among full and filtered concatenated and species trees 3. Number of Variables: 14 4. Variables: CT.Full : full concatenated tree CT.7: IQTREE tree estimated from the filtered > 7.43 dPIS concatenated alignment CT.5: IQTREE tree estimated from the filtered > 5.30 dPIS concatenated alignment CT.1: IQTREE tree estimated from the filtered > 1.31 dPIS concatenated alignment CT.359: IQTREE tree estimated from the filtered > 358.6 missing loci concatenated alignment CT.280: IQTREE tree estimated from the filtered > 280.4 missing loci concatenated alignment CT.170: IQTREE tree estimated from the filtered > 169.8 missing loci concatenated alignment ST-Full: ASTRAL tree estimated from the full dataset ST-7: ASTRAL tree estimated from the filtered > 7.43 dPIS dataset ST-5: ASTRAL tree estimated from the filtered > 5.30 dPIS dataset ST-1: ASTRAL tree estimated from the filtered > 1.31 dPIS dataset ST-359: ASTRAL tree estimated from the filtered > 358.6 missing loci dataset ST-280: ASTRAL tree estimated from the filtered > 280.4 missing loci dataset ST-170: ASTRAL tree estimated from the filtered > 169.8 missing loci dataset 5. Abbreviations used: CT.Full : full concatenated tree; CT.7: concatenated tree estimated from the filtered > 7.43 dPIS concatenated alignment; CT.5: concatenated tree estimated from the filtered > 5.30 dPIS concatenated alignment; CT.1: concatenated tree estimated from the filtered > 1.31 dPIS concatenated alignment; CT.359: concatenated tree estimated from the filtered > 358.6 missing loci concatenated alignment; CT.280: concatenated tree estimated from the filtered > 280.4 missing loci concatenated alignment; CT.170: concatenated tree estimated from the filtered > 169.8 missing loci concatenated alignment; ST-Full: species tree estimated from the full dataset; ST-7: species tree estimated from the filtered > 7.43 dPIS dataset; ST-5: species tree estimated from the filtered > 5.30 dPIS dataset; ST-1: species tree estimated from the filtered > 1.31 dPIS datase; ST-359: species tree estimated from the filtered > 358.6 missing loci dataset; ST-280: species tree estimated from the filtered > 280.4 missing loci dataset; ST-170: species tree estimated from the filtered > 169.8 missing loci dataset DATA-SPECIFIC INFORMATION FOR DIRECTORY: ~/knn_modeling/input 1. Number of files: 24 2. Description: CSV file for each subclade with the per locus summary stats for each subclade used for the KNN modeling 3. Number of Variables: 65 filName: file name taxName: subclade name Number of Sequences: number of sequences in a locus alignment Number of Site: number of sites in a locus alignment BaseFreq_a: Frequency of A bases BaseFreq_c: Frequency of C bases BaseFreq_g: Frequency of G bases BaseFreq_t: Frequency of T bases Freq Gaps: Frequency of Gaps segSites w Gaps: Segregating sites with gaps Num Sites 1 or more Subst: Number of sites with 1 or more substitutions freq_sites_w_1_obs_bases: frequency of sites with 1 observed bases freq_sites_w_2_obs_bases: frequency of sites with 2 observed bases freq_sites_w_3_obs_bases: frequency of sites with 3 observed bases freq_sites_w_4_obs_bases: frequency of sites with 4 observed bases concat: IQTREE gene likelihood for each locus assuming the concatenated tree topology species: IQTREE gene likelihood for each locus assuming the species tree topology No_of_taxa: Number of taxa Alignment_length: Locus length Total_matrix_cells: Total number of characters in the alignment Undetermined_characters: Total numbers of undetermined characters Missing_percent: Percentage of missing data No_variable_sites: Number of variable sites Proportion_variable_sites: Proportion of variable sites Parsimony_informative_sites: Parsimony informative sites Proportion_parsimony_informative: Proportion of parsimony informative sites AT_content: Percentage of sites that are A or T GC_content: Percentage of sites that are G or C A: Number of A characters C: Number of C characters G: Number of G characters T: Number of T characters K: Number of K characters M: Number of M characters R: Number of R characters Y: Number of Y characters S: Number of S characters W: Number of W characters B: Number of B characters V: Number of V characters H: Number of H characters D: Number of D characters X: Number of X characters N: Number of N characters O: Number of O characters X.: Number of - characters X..1: Number of ? characters X.x.x: Sample Order ctgtRF: Robinson-Foulds distance between the concatenated and gene tree V1.x.x: Data type X.y.x: Sample Order stftRF: Robinson-Foulds distance between the species and gene tree V1.y.x: Data type X.x.y: Sample Order ctgtTD: Tree distance between the concatenated and gene tree V1.x.y: Data type X.y.y: Sample Order stftTD: Tree distance between the species and gene tree V1.y.y: Data type tr1: Gene-wise likelihood score for locus X for the concatenated topology tr2: Gene-wise likelihood score for locus X for the species tree topology scaffold: Chromosome location of locu start: Beginning of the locus position in the concatenated alignment end: End of the locus position in the concatenated alignment probenumb: UCE probe number for mapping UCEs to genome DATA-SPECIFIC INFORMATION FOR FILE: ~/knn_modeling/output/delta-likelihood/combined_deltalk.csv 1. Number of files: 1 2. Description: Delta likelihood KNN model output for each subclade. Included are mean variable importance and metrics on model performance 3. Number of Variables: 13 4. Variable List: Taxon: Subclade used for KNN modeling mean_Alignment_length: importance of the variable alignment length over 100 bootstrapped KNN models mean_Missing_percent: importance of the variable percentage of missing data over 100 bootstrapped KNN models mean_Parsimony_informative_sites: importance of the variable parsimony informative sites over 100 bootstrapped KNN models mean_GC_content: importance of the variable GC content over 100 bootstrapped KNN models mean_Freq.Gaps: importance of the variable Frequency of Gaps over 100 bootstrapped KNN models mean_segSites.w.Gaps: importance of the variable segregating sites with gaps over 100 bootstrapped KNN models mean_Number.of.Sequences: importance of the variable characterizing sample size of an alignment over 100 bootstrapped KNN models mean_scaffolds: importance of the variable chromosome position of a locus over 100 bootstrapped KNN models mean_r2: the mean coefficient of determination over 100 bootstrapped KNN models sd_r2: the standard deviation of the coefficient of determination over 100 bootstrapped KNN models mean_RMSE: the mean root-mean-square-error over 100 bootstrapped KNN models sd_RMSE: the standard deviation of root-mean-square-error over 100 bootstrapped KNN models 5. Abbreviations used: mean_Freq.Gaps: mean frequency of gaps; mean_segSites.w.Gaps: mean number of segregating sites with gaps; mean_r2: mean coefficient of determination; sd_r2: the standard deviation of the coefficient of determination; mean_RMSE: the mean root-mean-square-error; sd_RMSE: the standard deviation of root-mean-square-error DATA-SPECIFIC INFORMATION FOR FILE: ~/knn_modeling/output/RF-distance-gt-st/combined_sp_rf.csv 1. Number of files: 1 2. Description: Robinson-Foulds distance KNN model output for each subclade. Included are mean variable importance and metrics on model performance 3. Number of Variables: 13 4. Variable List: Taxon: Subclade used for KNN modeling mean_Alignment_length: importance of the variable alignment length over 100 bootstrapped KNN models mean_Missing_percent: importance of the variable percentage of missing data over 100 bootstrapped KNN models mean_Parsimony_informative_sites: importance of the variable parsimony informative sites over 100 bootstrapped KNN models mean_GC_content: importance of the variable GC content over 100 bootstrapped KNN models mean_Freq.Gaps: importance of the variable Frequency of Gaps over 100 bootstrapped KNN models mean_segSites.w.Gaps: importance of the variable segregating sites with gaps over 100 bootstrapped KNN models mean_Number.of.Sequences: importance of the variable characterizing sample size of an alignment over 100 bootstrapped KNN models mean_scaffolds: importance of the variable chromosome position of a locus over 100 bootstrapped KNN models mean_r2: the mean coefficient of determination over 100 bootstrapped KNN models sd_r2: the standard deviation of the coefficient of determination over 100 bootstrapped KNN models mean_RMSE: the mean root-mean-square-error over 100 bootstrapped KNN models sd_RMSE: the standard deviation of root-mean-square-error over 100 bootstrapped KNN models 5. Abbreviations used: mean_Freq.Gaps: mean frequency of gaps; mean_segSites.w.Gaps: mean number of segregating sites with gaps; mean_r2: mean coefficient of determination; sd_r2: the standard deviation of the coefficient of determination; mean_RMSE: the mean root-mean-square-error; sd_RMSE: the standard deviation of root-mean-square-error