A rapid shift from traditional Sanger sequencing-based molecular methods to the phylogenomic approach with large numbers of loci is underway. Among phylogenomic methods, RAD (Restriction site Associated DNA) sequencing approaches have gained much attention as they enable rapid generation of up to thousands of loci randomly scattered across the genome and are suitable for non-model species. RAD data sets however suffer from large amounts of missing data and rapid locus dropout along with decreasing relatedness among taxa. The relationship between locus dropout and the amount of phylogenetic information retained in the data has remained largely un-investigated. Similarly, phylogenetic hypotheses based on RAD have rarely been compared with phylogenetic hypotheses based on multilocus Sanger sequencing, even less so using exactly the same species and specimens. We compared the Sanger-based phylogenetic hypothesis (8 loci; 6,172 bp) of 32 species of the diverse moth genus Eupithecia (Lepidoptera, Geometridae) to that based on double-digest RAD sequencing (3,256 loci; 726,658 bp). We observed that topologies were largely congruent, with some notable exceptions that we discuss. The locus dropout effect was strong. We demonstrate that number of loci is not a precise measure of phylogenetic information since the number of single-nucleotide polymorphisms (SNPs) may remain low at very shallow phylogenetic levels despite large numbers of loci. As we hypothesize, the number of SNPs and parsimony informative SNPs (PIS) is low at shallow phylogenetic levels, peaks at intermediate levels and, thereafter, declines again at the deepest levels as a result of decay of available loci. Similarly, we demonstrate with empirical data that the locus dropout affects the type of loci retained, the loci found in many species tending to show lower interspecific distances than those shared among fewer species. We also examine the effects of the numbers of loci, SNPs and PIS on nodal bootstrap support, but could not demonstrate with our data our expectation of a positive correlation between them. We conclude that RAD methods provide a powerful tool for phylogenomics at an intermediate phylogenetic level as indicated by its broad congruence with an eight-gene Sanger data set in a genus of moths. When assessing the quality of the data for phylogenetic inference, the focus should be on the distribution and number of SNPs and PIS rather than on loci.

FIGURE S1. Ultrametric trees

Ultrametric trees based on a) ddRAD-c80m6, b) ddRAD-c85m6, c) ddRAD-c90m6, d) combined nuclear, and e) mitochondrial COI data.

Fig.S1. ultrametric trees.pdf

FIGURE S2. Non-linear regression (through the origin) of bootstrap value on branch length

Non-linear regression (through the origin) of bootstrap value on branch length. The observations were weighted by the number of SNPs (a) and loci (b). See text for further details. The weights are indicated by the darkness of the points, darker color meaning higher weight.

Fig.S2. bootstrap_branch length.pdf

FIGURE S3. Phylogenetic trees inferred from the ddRAD data matrices

Phylogenetic trees inferred from the ddRAD data matrices built with different parameters of clustering threshold (c) and minimum individual per locus (m) and mitochondrial COI. Trees were inferred with RAxML with 500 bootstrap replicates. Bootstrap values are indicated near branches. Scale bars indicate substitutions per site.

Fig.S3. individual ddRAD and COI phylogenies.pdf

FIGURE S4. Box plot showing pairwise distance between species

Box plot showing pairwise distance between species for data matrices in different parameters of minimum individual (m) and clustering threshold (c). Each box represents the interquartile range of values and split at the median.

Fig.S4. box plots of p-distance (m, c value).pdf

FIGURE S5. Venn diagrams show the number of shared loci among species

Venn diagrams show the number of shared loci among species. The number in each field depicts the number of loci with a shared loci present in the corresponding species.

Fig.S5. venn diagram of shared loci among species.pdf

FIGURE S6. Proportion of loci shared among individuals of Eupithecia

Proportion of loci shared among individuals of Eupithecia in the ddRAD-c85m6 data matrix. Black-filled circles represent the proportion of the total number of loci shared among individuals. Red-filled circles represent the proportion of the total number of loci present in each individual. Circle scale shows the number of loci represented by 1.0 and 0.5 circle sizes. Black vertical bars represent the average proportion of loci shared by each individual.

Fig.S6. shared loci across samples.pdf

FIGURE S7. Phylogenetic trees of Eupithecia based on the data excluded six most poor-quality samples

Phylogenetic trees of Eupithecia based on (a) ddRAD-c85m6 and (b) ADO (allelic dropout) dataset which eliminated poor quality samples recovered less than 200 loci. The removed samples were coloured in grey in the tree. The trees were inferred with RAxML with 500 bootstrap replicates. Bootstrap values are indicated near branches.

Fig.S7. ADO (drop out bad quality samples).pdf

FIGURE S8. ML tree inferred from the reference assembly data matrix

ML tree inferred from the reference assembly data matrix. The phylogenetic tree was inferred with RAxML with 500 bootstrap replicates. Bootstrap values are indicated near branches. Scale bars indicate substitutions per site.

Fig.S8. rad_reference_assembly.pdf

Supplementary materials_Tables_S1

Collection information for the specimens used in this study

Supplementary materials_Tables_S2

The collected sequence data length in base pairs from one mitochondria and seven nuclear regions of each analyzed individuals

FIGURE S9. Number of loci recovered at different categories defined by the number of individuals sharing a locus

Number of loci recovered at different categories defined by the number of individuals sharing a locus

Fig.S9. loci vs. inds.pdf

FIGURE S10. Plot SNP missing data heatmap plot based on ddRAD data

Plot SNP missing data heatmap plot based on ddRAD data. This function created a heatmap plot of the level of missing data. Colors represent degree of missing data, completely missing is opaque color. Calculation of missing data by loci which gives a percentage of missing data for each cell.

Fig.S10. SNP heatmap.pdf

Supplementary materials_Tables_S3

The basic statistics for de novo assembly based on ddRAD-c85m6 data matrix and reference assembly against 26 Lepidoptera genomes. The eliminated samples are marked with asterisk due to the poor mapped reads in reference assembly.

Supplementary materials_Tables_S4

The number of recovered loci by all minimum number of individuals per locus (m).

Supplementary materials_Tables_S5

Regression coefficients for the number of SNPs/locus in relation to the number of individuals/locus. The coefficients are in the scale of the linear predictor of a generalized linear model with negative binomial error distribution and a logarithmic link function.

Supplementary materials_Tables_S6

Coefficients of regression models explaining variation in bootstrap residuals (i.e. the effect of branch length on bootstrap values is removed).

Supplementary materials_Tables_S7

The number and proportion of clusters at the sequencing depth (d) in the final data sets (d ≥ 3) with those of observed showing depth less than 3.

bootstrap_DRYAD_Rscript

R script to analyse variation in the bootstrap residuals with linear models

bootstrap_DRYAD.R

dropout_DRYAD_Rscript

R script to analyse locus and SNP dropout patterns at different phylogenetic depths

dropout_DRYAD.R

locus conservativeness_DRYAD_Rscript

R script to analyse locus conservativeness correlated with SNPs and patterns of locus and SNP dropout effects on nodel confidence

locus conservativeness_DRYAD.R

Data from: Information dropout patterns in restriction site associated DNA phylogenomics and a comparison with multilocus Sanger data in a species-rich moth genus

Data files

Abstract

FIGURE S1. Ultrametric trees

FIGURE S2. Non-linear regression (through the origin) of bootstrap value on branch length

FIGURE S3. Phylogenetic trees inferred from the ddRAD data matrices

FIGURE S4. Box plot showing pairwise distance between species

FIGURE S5. Venn diagrams show the number of shared loci among species

FIGURE S6. Proportion of loci shared among individuals of Eupithecia

FIGURE S7. Phylogenetic trees of Eupithecia based on the data excluded six most poor-quality samples

FIGURE S8. ML tree inferred from the reference assembly data matrix

Supplementary materials_Tables_S1

Supplementary materials_Tables_S2

FIGURE S9. Number of loci recovered at different categories defined by the number of individuals sharing a locus

FIGURE S10. Plot SNP missing data heatmap plot based on ddRAD data

Supplementary materials_Tables_S3

Supplementary materials_Tables_S4

Supplementary materials_Tables_S5

Supplementary materials_Tables_S6

Supplementary materials_Tables_S7

bootstrap_DRYAD_Rscript

dropout_DRYAD_Rscript

locus conservativeness_DRYAD_Rscript

Data from: Information dropout patterns in restriction site associated DNA phylogenomics and a comparison with multilocus Sanger data in a species-rich moth genus

Data files

Abstract

Usage notes

FIGURE S1. Ultrametric trees

FIGURE S2. Non-linear regression (through the origin) of bootstrap value on branch length

FIGURE S3. Phylogenetic trees inferred from the ddRAD data matrices

FIGURE S4. Box plot showing pairwise distance between species

FIGURE S5. Venn diagrams show the number of shared loci among species

FIGURE S6. Proportion of loci shared among individuals of Eupithecia

FIGURE S7. Phylogenetic trees of Eupithecia based on the data excluded six most poor-quality samples

FIGURE S8. ML tree inferred from the reference assembly data matrix

Supplementary materials_Tables_S1

Supplementary materials_Tables_S2

FIGURE S9. Number of loci recovered at different categories defined by the number of individuals sharing a locus

FIGURE S10. Plot SNP missing data heatmap plot based on ddRAD data

Supplementary materials_Tables_S3

Supplementary materials_Tables_S4

Supplementary materials_Tables_S5

Supplementary materials_Tables_S6

Supplementary materials_Tables_S7

bootstrap_DRYAD_Rscript

dropout_DRYAD_Rscript

locus conservativeness_DRYAD_Rscript

Works referencing this dataset