Data from: A real data-driven simulation strategy to select an imputation method for mixed-type trait data

May, Jacqueline A.1 ; Feng, Zeny1; Adamowicz, Sarah J.1

Published Feb 15, 2023 on Dryad. https://doi.org/10.5061/dryad.crjdfn37m

Data files

Feb 15, 2023 version files 235.82 KB

README.md

4.46 KB
Squamata_CMOS_Gene_FinalImp.tre

47.43 KB
Squamata_CMOS_Gene.tre

10.43 KB
Squamata_COI_Alignment.fasta

142.21 KB
Squamata_COI_Gene.tre

10.43 KB
Squamata_Multigene.tre

10.43 KB
Squamata_RAG1_Gene.tre

10.43 KB

Abstract

Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advantageous. We invoked a real data-driven simulation strategy to select an imputation method for a given mixed-type (categorical, count, continuous) target dataset. Candidate methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Using a trait dataset of squamates (lizards and amphisbaenians; order: Squamata) as a target dataset, a complete-case dataset consisting of species with nearly completed information was formed for the imputation method selection. Missing data were induced by removing values from this dataset under different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For each method, combinations with and without phylogenetic information from single gene (nuclear and mitochondrial) or multigene trees were used to impute the missing values for five numerical and two categorical traits. The performances of the methods were evaluated under each missing mechanism by determining the mean squared error and proportion falsely classified rates for numerical and categorical traits, respectively. A random forest method supplemented with a nuclear-derived phylogeny resulted in the lowest error rates for the majority of traits, and this method was used to impute missing values in the original dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to complete-case data. However, caution should be taken when imputing trait data as phylogeny did not always improve performance for every trait and in every scenario. Ultimately, these results support the use of a real data-driven simulation strategy for selecting a suitable imputation method for a given mixed-type trait dataset.

Cytochrome c oxidase subunit I (COI) sequence records were originally downloaded from The Barcode of Life Data System (BOLD) (Ratnasingham & Hebert, 2007) (sequence data available at dx.doi.org/10.5883/DS-IMPMIX2). Data were filtered for records that have been identified to the species level. Additional quality control checks on the sequence data included trimming N and gap content from sequence ends and removing sequences with greater than 1% of internal N and/or gap content across their entire sequence length. The AlignTranslation function from the R package “DECIPHER” v. 2.18.1 (Wright, 2015, 2020) was used to perform a multiple sequence alignment on the COI sequences.

Phylogenetic trees were built using RAxML v. 8 (Stamatakis, 2014). The model GTRGAMMAI was specified (option -m), and the alignments were partitioned based on codon position (option -q). Nuclear sequence data used for building the c-mos, RAG1 and multigene trees were obtained from a multigene alignment published in Pyron et al. (2013) available at https://doi.org/10.5061/dryad.82h0m. GenBank accession numbers for the sequence records used to build the tree in S2 File.

Citations/data sources:

Pyron RA, Burbrink FT, Wiens JJ. A phylogeny and revised classification of Squamata, including 4161 species of lizards and snakes. BMC Evol Biol. 2013 Apr 29;13(1):93.
Pyron RA, Burbrink FT, Wiens JJ. Data from: A phylogeny and revised classification of Squamata, including 4161 species of lizards and snakes. Dryad Dataset [Internet]. 2013; Available from: https://doi.org/10.5061/dryad.82h0m
Ratnasingham S, Hebert PDN. bold: The Barcode of Life Data System (http://www.barcodinglife.org). Mol Ecol Notes. 2007 May 1;7(3):355–64.
Data from: Barcode of Life Data System: DS-IMPMIX2: Squamata cytochrome c oxidase subunit I (COI) dataset. [Internet]. 2020. Available from: dx.doi.org/10.5883/DS-IMPMIX2
Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi, I. GenBank. Nucleic Acids Research. 2020;48:D84-D86.
Stamatakis A. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–3.
Wright ES. DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment. BMC Bioinformatics. 2015 Oct 6;16(1):322.
Wright ES. RNAconTest: comparing tools for noncoding RNA multiple sequence alignment based on structural consistency. RNA. 2020 May 1;26(5):531–40.

Data from: A real data-driven simulation strategy to select an imputation method for mixed-type trait data

Data files

Abstract

Methods

Usage notes