Fragmentary gene sequences negatively impact gene tree and species tree reconstruction

Mirarab, Siavash 1 ; Sayyari, Erfan1; Whitfield, James B.2

Published Jun 07, 2023 on Dryad. https://doi.org/10.6076/D14599

Data files

Jun 07, 2023 version files 3.29 GB

insects-biological-species_trees-gene_trees.tar.gz

48.22 MB
insects-biological-species_trees-gene_trees.tar.list.txt

3.71 KB
insects-simulated.tar.gz

3.02 GB
insects-simulated.tar.list.txt

1.10 MB
README.md

2.33 KB
sequences-insects-biological.tar

214.53 MB
sequences-insects-biological.tar.list.txt

3.91 MB

Abstract

Species tree reconstruction from genome-wide data is increasingly being attempted, in most cases using a two-step approach of first estimating individual gene trees and then summarizing them to obtain a species tree. The accuracy of this approach, which promises to account for gene tree discordance, depends on the quality of the inferred gene trees. At the same time, phylogenomic and phylotranscriptomic analyses typically use involved bioinformatics pipelines for data preparation. Errors and shortcomings resulting from these preprocessing steps may impact the species tree analyses at the other end of the pipeline. In this article, we first show that the presence of fragmentary data for some species in a gene alignment, as often seen on real data, can result in substantial deterioration of gene trees, and as a result, the species tree. We then investigate a simple filtering strategy where individual fragmentary sequences are removed from individual genes but the rest of the gene is retained. Both in simulations and by reanalyzing a large insect phylotranscriptomic data set, we show the effectiveness of this simple filtering strategy.

Analysis Pipeline for Insect Data Set

We used the amino acid sequence data provided by Misof et al. (2014) as “supplementary 7.”

Filtering Strategy

Before identifying fragmentary sequences, we first remove extremely gappy sites, defined as those with >90% gaps. We then remove species that have <20% (1/5), 25% (1/4), 33% (1/3), 50% (1/2), 66% (2/3), 75% (3/4), or 80% (4/5) amino-acids (i.e., characters other than gaps). We use a tool called seqtools, implemented as part of the PASTA. Next, we reestimate gene trees. In order to track the occupancy and bootstrap support, we use in-house scripts, available online https://github.com/esayyari/discoVista.

Gene Trees and Species Trees

Gene trees are estimated using FastTree2 (Price et al. 2010) using its default amino acid substitution model JTT (Jones et al. 1992) or RAxML (Stamatakis 2014) with the automatic amino acid model selection. We use RAxML (Stamatakis 2014), version 8.2.9 with ten runs of inference using different starting trees. We used RAxML’s automatic model selection approach. When several species have identical sequences for a gene, we keep only one of them (i.e., remove redundant ones) in our RAxML runs and add the removed species back to the final inferred gene tree as a polytomy.

For performing gene tree bootstrapping using FastTree, we first generate bootstrap sequences using RAxML and then run FastTree on those to estimate the bootstrapped gene trees. We then draw those bootstrap gene trees on ML gene tree branches using the Newick utility (Junier and Zdobnov 2010). For RAxML gene trees, we use the rapid bootstrapping option on reduced sequences (after removing identical sequences). After gene tree estimations, we add back the identical species and draw these bootstrap gene trees on the best ML gene trees (RAxML) following the same procedure using the Newick utility.

We use ASTRAL-II to estimate the species trees summarizing gene trees with at least four taxa left after filtering.

Simulation Procedure

We use one model condition of a previously simulated data set from Mirarab and Warnow (2015) ASTRAL-II paper with 100 ingroup taxa and one outgroup. For each of the 50 replicates in this data set, Simphy (Mallo et al. 2016) was used to simulate a species tree according to the Yule model, and then 1,000 gene trees were simulated using the MSC model which captures ILS. The data set has moderate levels of ILS; the average distance between true gene trees and true species trees is 0.33. We subsampled genes to create three different data sets with 50, 200, or 1,000 genes. DNA sequences of varying lengths were simulated down the gene trees using Indelible (Fletcher and Yang 2009) with GTR parameters and stationary distributions estimated from published biological data sets, as detailed by Mirarab and Warnow (2015). Note that simulated sequences did not include any indels and thus were already aligned. Mirarab and Warnow (2015) suggested removing two replicates that include almost no phylogenetic signal, and we use the same strategy, leaving us with 48 replicates. This creates our unfiltered base data set.

We add fragmentation to our complete simulated data set using a procedure that seeks to emulate patterns of fragmentation in the insect biological data set. 1) For each replicate, we order species in the biological data set and the simulated data set with respect to the tip-to-root distances. 2) We randomly select 100 of the biological species and map them to the simulated species with the same position in the order. The main outgroup (Ixodes scapularis) in the biological and simulated data sets always map to each other. 3) For each replicate in the simulated data set, we randomly sample (with replacement) 1,000 genes in the insect data sets that have at least 101 species, including the main outgroup. 4) For each species in each simulated gene, we compute the portion of gap sites in the corresponding gene alignment for the corresponding species in the biological data and remove the same portion of sites in the simulated data set at random positions. When a species is missing from a gene in the biological data set, we use the same species from another randomly chosen gene.

Filtering Fragments and Gene Trees and Species Trees

We first remove sites with >90% gaps, removing between 0.0% and 2.0% of the total number of characters in all sequences. We then remove from each gene any species that has less than a certain fraction (e.g., 10–80%) of the full gene. For example, at 10%, we remove only sequences that have 90% or more gaps. For each threshold, we then estimate gene trees using both RAxML (Stamatakis 2014) version 8.2.9 with two starting trees and FastTree (Price et al. 2010) version 2.1.9 Double precision using the GTR + Γ model of sequence evolution (Tavaré 1986). We infer the species tree using ASTRAL-II (Mirarab and Warnow 2015) version 4.11.1. We build species trees using all 1,000 genes or using randomly chosen subsets of 200 or 50 genes.