Data from: Assexon: assembling exon using gene capture data
Yuan, Hao; Atta, Calder; Tornabene, Luke; Li, Chenhong (2019), Data from: Assexon: assembling exon using gene capture data, Dryad, Dataset, https://doi.org/10.5061/dryad.b0r170c
Exon capture across species has been one of the most broadly applied approaches to acquire multi-locus data in phylogenomic studies of non-model organisms. Methods for assembling loci from short-read sequences (eg, Illumina platforms) that rely on mapping reads to a reference genome may not be suitable for studies comprising species across a wide phylogenetic spectrum; thus, de novo assembling methods are more generally applied. Current approaches for assembling targeted exons from short reads are not particularly optimized as they cannot (1) assemble loci with low read depth, (2) handle large files efficiently, and (3) reliably address issues with paralogs. Thus, we present Assexon: a streamlined pipeline that de novo assembles targeted exons and their flanking sequences from raw reads. We tested our method using reads from Lepisosteus osseus (4.37 Gb) and Boleophthalmus pectinirostris (2.43 Gb), which are captured using baits that were designed based on genome sequence of Lepisosteus oculatus and Oreochromis niloticus, respectively. We compared performance of Assexon to PHYLUCE and HybPiper, which are commonly used pipelines to assemble ultra-conserved element (UCE) and Hyb-seq data. A custom exon capture analysis pipeline (CP) developed by Yuan et al was compared as well. Assexon accurately assembled more than 3400 to 3800 (20%-28%) loci than PHYLUCE and more than 1900 to 2300 (8%-14%) loci than HybPiper across different levels of phylogenetic divergence. Assexon ran at least twice as fast as PHYLUCE and HybPiper. Number of loci assembled using CP was comparable with Assexon in both tests, while Assexon ran at least 7 times faster than CP. In addition, some steps of CP require the user’s interaction and are not fully automated, and this user time was not counted in our calculation. Both Assexon and CP retrieved no paralogs in the testing runs, but PHYLUCE and Hybpiper did. In conclusion, Assexon is a tool for accurate and efficient assembling of large read sets from exon capture experiments. Furthermore, Assexon includes scripts to filter poorly aligned coding regions and flanking regions, calculate summary statistics of loci, and select loci with reliable phylogenetic signal. Assexon is available at https://github.com/yhadevol/Assexon.