Target enrichment of long open reading frames and ultraconserved elements to link microevolution and macroevolution in non-model organisms

Ortiz-Sepulveda, Claudia1 ; Genete, Mathieu2; Blassiau, Christelle1; Godé, Cécile2; Albrecht, Christian3; Vekemans, Xavier1; Van Bocxlaer, Bert 2

Published Nov 10, 2022 on Dryad. https://doi.org/10.5061/dryad.4j0zpc8gc

Abstract

Despite the increasing accessibility of high-throughput sequencing, obtaining high-quality genomic data on non-model organisms without proximate well-assembled and annotated genomes remains challenging. Here we describe a workflow that takes advantage of distant genomic resources and ingroup transcriptomes to select and jointly enrich long open reading frames (ORFs) and ultraconserved elements (UCEs) from genomic samples for integrative studies of microevolutionary and macroevolutionary dynamics. This workflow is applied to samples of the African unionid bivalve tribe Coelaturini (Parreysiinae) at basin and continent-wide scales. Our results indicate that ORFs are efficiently captured without prior identification of intron-exon boundaries. The enrichment of UCEs was less successful but nevertheless produced substantial datasets. Exploratory continent-wide phylogenetic analyses with ORF supercontigs (> 515,000 parsimony informative sites) resulted in a fully resolved phylogeny, the backbone of which was also retrieved with UCEs (> 11,000 informative sites). Variant calling on ORFs and UCEs of Coelaturini from the Malawi Basin produced ~2,000 SNPs per population pair. Estimates of nucleotide diversity and population differentiation were similar for ORFs and UCEs. They were low compared to previous estimates in mollusks, but comparable to those in recently diversifying Malawi cichlids and other taxa at an early stage of speciation. Skimming off-target sequence data from the same enriched libraries of Coelaturini from the Malawi Basin, we reconstructed the maternally-inherited mitogenome, which displays the gene order inferred for the most recent common ancestor of Unionidae. Overall, our workflow and results provide exciting perspectives for integrative genomic studies of microevolutionary and macroevolutionary dynamics in non-model organisms.

Here we briefly describe the provided files. Transcriptomics were performed on 12 Coelaturini, which are indicated in Supplementary Table 1 of Ortiz-Sepulveda et al. (2022). Enrichment of ORF and UCE targets from genomic DNA libraries of 96 individuals, i.e. 95 Coelaturini and one iridinid bivalve, which are outlined in Supplementary Table 3 of Ortiz-Sepulveda et al. (2022). Specifically, 48 individuals were included for phylogenetic analysis at a continental scale and 48 specimens of Coelaturini from the Malawi Basin were included for population genomics.

Bioinformatic scripts are available via www.github.com/bertvanbocxlaer/target_enrichment_ORF_UCE. Raw sequencing data are available at NCBI under BioProject PRJNA893605. BioSamples SAMN31437443-SAMN31437454 contain raw sequencing reads from RNA-seq, whereas raw sequencing reads from target enrichment are available under SAMN31439307-SAMN31439402. Below is information on each file uploaded to Dryad:

Ortiz-Sepulveda_et_al-CoelaturiniSupertranscriptome_95p_ORFs_95p.fa

This file contains all open reading frames (ORFs) that were considered for our orthology assessment. More precisely, we first clustered our 12 de novo assembled transcriptomes into a supertranscriptome by grouping all contigs with >95% similarity and representing each group by the longest contig. Subsequently, we predicted ORFs on these contigs and selected the best-supported ORF per contig. This pool of ORFs was then clustered again to retain homologous ORF clusters, each of which contains all ORFs with >95% similarity. This resulting library of ORF homologs was then used for comparisons to the BUSCO Metazoa_odb9 database (Simão et al., 2015; Waterhouse et al., 2017) and the Unioverse probe set (Pfeiffer et al., 2019).

Ortiz-Sepulveda_et_al-FinalTargets_ORFs_UCEs.fasta

This file contains the final ORF and UCE targets considered for probe design with a MyBaits custom kit for target enrichment. Each .fasta sequence represents a complete target. Targets that were selected for probe design, but for which no probes could be designed are still present in this file. Specifically, this concerns three UCE targets, i.e. >slice-1603-chr13-pos-2343997-2344108-6-UCE-str, >slice-1421-chr9-pos-9900750-9900854-6-UCE-str and >slice-3362-chr5-pos-36441263-36441395-5-UCE-opt.

Ortiz-Sepulveda_et_al-Additional_Data_Tables.xlsx

This file contains four additional data tables. Additional tables 1 and 2 contain data on specimen representation, alignment length, and the proportion of informative and missing sites for each ORF and UCE locus, respectively. Additional tables 3 and 4 contain the results of the partition analysis using Modeltest in IQTREE for ORFs and UCEs, respectively. For ORFs we obtained a 402–partition scheme with 45 unique substitution models, whereas for UCEs a 155-partition scheme with 51 unique substitution models.

Ortiz-Sepulveda_et_al-Coelaturini_Mitogenome_FINAL.fasta

This file contains the entire female-inherited mitogenome obtained for Coelaturini from the Malawi Basin through iterative rounds of genome skimming.

Ortiz-Sepulveda_et_al-Coelaturini_Mitogenes_FINAL.fasta

This file contains the sequences of the genes of the female-inherited mitogenome of Coelaturini from the Malawi Basin.

Ortiz-Sepulveda_et_al-Mitogenome_annotations_FINAL.gff

This file contains the annotation of the entire female-inherited mitogenome obtained for Coelaturini from the Malawi Basin.

no_SRR_0.99_lognorms-cutoff-no_ref_mafft-clean-50p-concat_trimmed.fasta-out.fas

This file contains the final concatenated alignment of our open reading frames (after all cleaning procedures) as used for phylogenetic reconstruction.

0.99_lognorms-cutoff-mafft-fasta-notrim-clean-50p-concat-trim.fasta
This file contains the final concatenated alignment for our ultraconserved element (after all cleaning procedures) as used for phylogenetic reconstruction.

Capture_ORF_45_filtered_08.vcf.gz

This file contains the final variant calling information for our open reading frames (exons only, without intronic/intergenic flanking regions) that is used as input for Pixy to calculate nucleotide diversity (π), the absolute nucleotide divergence between population pairs (D_XY) and population differentiation (F_ST).

Capture_UCE_45ind_filtered_08.vcf.gz
This file contains the final variant calling information for our ultraconserved elements (with flanking regions) that is used as input for Pixy to calculate nucleotide diversity (π), the absolute nucleotide divergence between population pairs (D_XY) and population differentiation (F_ST).

clean_80_mac3_45_m95_maf_005_8.recode.vcf

This file contains the final variant calling information for our open reading frames (without intronic/intergenic flanking regions) that is used to examine genetic structure with principal component analysis and fastSTRUCTURE.

References

Pfeiffer, J. M., Breinholt, J. W., & Page, L. M. (2019). Unioverse: a phylogenomic resource for reconstructing the evolution of freshwater mussels (Bivalvia, Unionoida). Molecular Phylogenetics and Evolution, 137, 114-126.

Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., & Zdobnov, E. M. (2015). BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 31(19), 3210-3212.

Waterhouse, R. M., Seppey, M., Simão, F. A., Manni, M., Ioannidis, P., Klioutchnikov, G., . . . Zdobnov, E. M. (2017). BUSCO applications from quality assessments to gene prediction and phylogenomics. Molecular Biology and Evolution, 35(3), 543-548.

Target enrichment of long open reading frames and ultraconserved elements to link microevolution and macroevolution in non-model organisms

Data files

Abstract

Methods

Usage notes

Works referencing this dataset