Metadata for tables, supplementary files and figures, data files and scripts for "A two-tier bioinformatic pipeline to develop probes for target capture of nuclear loci with applications in Melastomataceae" Table 1: Contains two lists of data sources. Locus Sources - four methods used for selecting loci to sequence. Genomic Sources for Template Sequences - six sources of template sequences from which probes were developed. Each locus was represented by between one and four template sequences for probe design. Table 2: Contains summary statistics for select species of Memecylon and Tibouchina. Average, minimum and maximum values are presented when multiple individuals were sequenced per species. Percent on-target reads - percent of reads that were mapped to template sequences, averaged for each species. Number of total reads - number of total reads recovered, averaged for each species. Mean locus length per species (excluding zeros) - an average of total exon sequence lengths for all loci after removing loci for which sequences were not recovered (zero lengths). Maximum locus length per species - length of the longest locus (based on total exon lengths) for each species. Number of templates with sequences (min-max) - number of templates, from which probes were designed, that had reads mapped, contigs assembled, and sequences recovered by HybPiper (no minimum length requirement), with minimum and maximum numbers, averaged per species when multiple individuals were sequenced; out of a maximum of 689. Number of templates with sequences at 50% of reference length (min-max) - number of template sequences, from which probes were designed, that recovered sequences that were at least 50% of the template sequence length; out of a maximum of 689. Number of loci with potential paralogs - number of loci (out of 384) that were flagged as having potential paralogs by HybPiper, due to the number and similarity of individual contigs; averaged by species when more than one sample per species. Table 3: Contains summary statistics for the population level sampling of Memecylon, with statistics presented for each individual sampled. Number of total reads - total reads sequenced. Percent on-target reads - percent of reads out of total that mapped to template sequences. Number of templates with sequences - out of 689, number of templates that had reads mapped, contigs assembled, and sequences recovered by HybPiper. Number of templates with sequences at 50% of reference length - out of 689, number of templates that recovered sequences with a total length of exons at least 50% the length of the template sequence. Number of potential paralogs - number of loci that were flagged as having potential paralogs by HybPiper, due to the number and similarity of individual contigs. Table 4: Contains summary statistics for target enrichment of Memecylon and Tibouchina, averaged for each clade. Statistics as described above for Table 1 are averaged for each clade rather than averaged for each species. Additionally: Mean read depth - calculated as (number of reads mapped X read length) / (sum of length of target loci). Mean taxon count per locus (template) - mean number of taxa recovered (with sequences >0 bp for a locus or template). Mean template count per species (min-max) - number of templates that successfull recovered sequences for each species, averaged for the clade. Mean number of templates with sequences (min-max) - mean number of templates that recovered sequences, for each individual, averaged for the clade. Mean total exon length (bp) - average of the sum of exon lengths for each locus (individual exons for a given locus are joined to get total exon length). Mean total intron length (bp) - average of the sum of the intron lengths for each locus (individual introns for a given locus are joined to get total intron lengths). Mean supercontig length (bp) - average of the total supercontig lengths (including exon and intron sequences). Table 5: Contains sequencing statistics for target enrichment of Memecylon and Tibouchina, grouped by genomic resource used to design probes. The same statistics for two sample clades are presented: Memecylon and Tibouchina. Five genome sources for probe design are presented: Memecylon genome skims, Miconia GenBank sequences, Angiosperm 353 Rosid sequences, Tibouchina genome skims, and Medinilla and Tetrazygia transcriptomes. Number of taxa per template - for each genome source, the average number of taxa recovered per template. Average (min-max) percent identity between templates and target sequecnes - percent identity, as calculated including gaps and trailing ends of sequences, of template reference sequence and assembled exon sequence. Percent reads on-target: calculated separately for loci designed using MarkerMiner and from Angiosperm353. NA - values represent sequences/categories for which the data was unavailable or was not calculated. Table 6: Variation statistics for a subset of loci sequenced for Memecylon. Samples are divided into two datasets: those for phylogenetic analysis at the interspecific level, and those for population level. Variation statistics are calculated separately for exon sequences, intron sequences, and for supercontigs (both exons and introns), and separately for loci developed using MarkerMiner, for those from Angiosperm 353, and for all (subset) loci together. Number of genes - number of genes included in analysis, selected due to sample completeness, alignemnt quality, and locus length. Alignment length - total length of alignment. Parsimony-informative sites (%) - number and percent of sites (out of alignment length) that are parsimiony informative, as determined by IQ-tree. Constant sites - invariant sites as determined by IQ-tree. Missing data (%) - percent of alignment represented by missing data. Appendix 1: Voucher information for samples sequenced using this probe set and NCBI SRA accessions for cleaned reads. Appendix 2: Full version of Table 2 for all species of Memecylon and Tibouchina sequenced. Supplementary Material S1: Detailed DNA extraction protocol for Melastomataceae including a pre-wash step. Supplementary Table S2: Excel sheet of full locus and template statistics for the two clades. Both locus- and template-specific statistics are presented. For each locus, we present the locus source (method of selecting locus), average of locus total exon length when zero-length sequences are removed, maximum length for the locus, and number of taxa recovered. These stats are presented for both Memecylon and Tibouchina. For each template, we present the template length, and for each clade, the number of taxa recovered, the average and min-max lengths of total exon sequences, the total intron sequences, and supercontigs, as well as the average percent identity between the template sequence and the exon sequence, and the percent of template sequence recovered (based on exon sequence length). Loci recovered from more than 50% of species from each clade are highlighted in blue. Templates that captured sequences more than 500bp long for both clades are highlighted in green. NAs represent zero lengths (locus-based statistics) or no data/not calculated (template-based statistics). Supplementary Figure S1: Comparison of amino acid and nucleotide sequence assembly success. Assembly was conducted for each set of template sequences using the cleaned reads recovered using the probe set designed from the nucleotide-based templates. Amino acid sequences are from the SWGX (Tetrazygia) and WWQZ (Medinilla) sequences retrieved from the 1KP capstone paper alignments (https://doi.org/10.1038/s41586-019-1693-2). WWQZ and SWGX nucleotide sequences and rosid nucleotide sequences are from the Angiosperm 353 probe github page (https://github.com/mossmatters/Angiosperms353). Memecylon and Tibouchina nucleotide sequences are from genome skimming data. The y axis is the log(sequence length). Numbers above boxplots are the number of sequences represented in each boxplot (across all loci and all samples). Supplementary Figure S2: Comparison of number of template sequences per locus and the log(read depth). Read depth was calculated using the HybPiper script depth_calculator.py and all template sequences together.