Data from: An evaluation of transcriptome-based exon capture for frog phylogenomics across multiple scales of divergence (Class: Amphibia, Order: Anura)

Portik, Daniel M.1 2; Smith, Lydia L.3; Bi, Ke3

Published May 27, 2016 on Dryad. https://doi.org/10.5061/dryad.pr3pr

Data files

May 27, 2016 version files 67.42 MB

Abstract

Custom sequence capture experiments are becoming an efficient approach for gathering large sets of orthologous markers in nonmodel organisms. Transcriptome-based exon capture utilizes transcript sequences to design capture probes, typically using a reference genome to identify intron–exon boundaries to exclude shorter exons (<200 bp). Here, we test directly using transcript sequences for probe design, which are often composed of multiple exons of varying lengths. Using 1260 orthologous transcripts, we conducted sequence captures across multiple phylogenetic scales for frogs, including outgroups ~100 Myr divergent from the ingroup. We recovered a large phylogenomic data set consisting of sequence alignments for 1047 of the 1260 transcriptome-based loci (~561 000 bp) and a large quantity of highly variable regions flanking the exons in transcripts (~70 000 bp), the latter improving substantially by only including ingroup species (~797 000 bp). We recovered both shorter (<100 bp) and longer exons (>200 bp), with no major reduction in coverage towards the ends of exons. We observed significant differences in the performance of blocking oligos for target enrichment and nontarget depletion during captures, and differences in PCR duplication rates resulting from the number of individuals pooled for capture reactions. We explicitly tested the effects of phylogenetic distance on capture sensitivity, specificity, and missing data, and provide a baseline estimate of expectations for these metrics based on a priori knowledge of nuclear pairwise differences among samples. We provide recommendations for transcriptome-based exon capture design based on our results, cost estimates and offer multiple pipelines for data assembly and analysis.

Afrixalus paradorsalis annotated transcriptome

Whole RNA from a portion of liver sample preserved in RNA Later was extracted using the RNeasy Protect Mini Kit (Qiagen). Sequencing libraries were prepared using half reactions of the TruSeq RNA Library Preparation Kit V2 (Illumina), beginning with Poly-A selection for samples with high RIN scores (> 7.0) and Ribo-Zero Magnetic Gold (Epicentre) ribosomal RNA removal for samples with low RIN scores (< 7.0). Libraries were sequenced on an Illumina HiSeq2500 with 100 bp paired-end reads. Transcriptomic data were cleaned following Singhal (2013). Cleaned data were assembled using TRINITY (Grabherr et al. 2011) and annotated with Xenopus tropicalis (Ensembl) as a reference genome using reciprocal BLASTX (Altschul et al. 1997) and EXONERATE (Slater & Birney 2005).

Afr_paradorsalis.fasta

Hyperolius balfouri annotated transcriptome

Hyp_balfouri.fasta

Hyperolius riggenbachi annotated transcriptome

Hyp_riggenbachi.fasta

Kassina decorata annotated transcriptome

Kass_decorata.fasta

Hyperoliid Orthologous Transcript Set

Marker set consisting of 1,265 orthologous transcripts (trimmed to 500-850 bp) from four species of hyperoliid frogs (5,060 total sequences). We compared annotated transcripts from the four species to search for orthologs via BLAST (Altschul et al. 1990). We removed mitochondrial loci from the transcripts. We only kept transcripts with a GC between 40%-70% because extreme GC content causes a reduced capture efficiency for the targets (Bi et al. 2012). Orthologous transcripts with a minimum length of 500 base pairs (bp) were identified across all four samples, resulting in the identification of 2,444 shared transcripts. Transcripts exceeding 850 bp were arbitrarily trimmed to this length for probe design, reflecting a trade-off decision between locus length and the total number of loci included in the experiment. The orthologous transcripts were subjected to additional filtering steps before a final gene set was chosen. The initial filtering step applied upper and lower limits on average transcript divergence, eliminating loci with low variation (< 5.0% average divergence) and exceptionally high variation (> 15.0% average divergence), resulting in the removal of 266 genes. The remaining 2,178 genes were examined for repetitive elements, short repeats, and low complexity regions, which are problematic for probe design and capture. The four sets of transcripts per gene (totaling 8,712 sequences) were screened using the REPEATMASKER Web Server (Smit et al. 2015). This step resulted in the masking of repetitive elements or low complexity regions in 929 sequences, with 7,783 sequences passing the filters. To be conservative, if any of the four transcripts for a gene contained masked sites, that gene was removed from the final marker set, which resulted in the removal of an additional 468 markers. From this reduced set of 1,710 markers, 400 markers with the highest divergence were selected (average divergence ranging from 10.4% to 14.9%) followed by 860 randomly drawn markers from the remaining subset. This marker set was supplemented with five positive controls, which consisted of nuclear sequence data generated using Sanger sequencing for five loci: POMC (624 bp), RAG-1 (777 bp), TYR (573 bp), FICD (524 bp), and KIAA2013 (540 bp). The final marker set selected for probe design included 1,265 genes from four species and 5,060 individual sequences.

Hyperoliid_Probe_Set.fasta

Hyperoliid MYbaits-3 custom probe set

The MYcroarray MYbaits-3 custom bait library (MYcroarray) design. There are 60,179 120mer baits in this file, allowing for 2x tiling (every 60 bp) of the 5,060 sequences. The kit allows 60,060 probes, therefore 119 probes were randomly dropped for final kit design.

bait-120-60.fas

Captured Exon Alignments

A fasta file of combined aligned sets of captured transcripts (exons only) for the four hyperoliid species used for transcriptome sequencing and probe design. Only markers with all 4 species present were kept, resulting in 999 transcripts. Names of transcripts correspond to those used in annotated transcriptomes, orthologous transcript set, and probe set for cross referencing. The concatenated alignment length of flanking regions is 592,651 base pairs.

Combined_Target_Output.fasta

Captured Flanking Region Alignments

A fasta file of combined aligned sets of captured flanking regions for the four hyperoliid species used for transcriptome sequencing and probe design. Only markers with all 4 species present were kept, resulting in 1071 flanking markers. Names of markers associated with flanking regions correspond to those used in annotated transcriptomes, orthologous transcript set, and probe set for cross referencing. The concatenated alignment length of flanking regions is 797,016 base pairs.

Combined_Flanking_Output.fasta