Skip to main content
Dryad logo

Alignments for probes, raw WGS reads, and WGS assemblies

Citation

Nydam, Marie et al. (2021), Alignments for probes, raw WGS reads, and WGS assemblies , Dryad, Dataset, https://doi.org/10.5061/dryad.3r2280gf7

Abstract

Ascidians (Phylum Chordata, Class Ascidiacea) are a large group of invertebrates which occupy a central role in the ecology of marine benthic communities.  Many ascidian species have become successfully introduced around the world via anthropogenic vectors.  The botryllid ascidians (Order Stolidobranchia, Family Styelidae) are a group of 53 colonial species, several of which are widespread throughout temperate or tropical and subtropical waters.  However, the systematics and biology of this group of ascidians is not well-understood.  To provide a systematic framework for this group, we have constructed a well-resolved phylogenomic tree using 200 novel loci and 55 specimens.  A Principal Components Analysis of all species described in the literature using 31 taxonomic characteristics revealed that some species occupy a unique morphological space and can be easily identified using characteristics of adult colonies.  For other species, additional information such as larval or life history characteristics may be required for taxonomic discrimination.  Molecular barcodes are critical for guiding the delineation of morphologically similar species in this group.

Methods

Anchored Hybrid Enrichment (AHE) Locus Identification and Probe Design

Our aim was to develop a resource for collecting hundreds of orthologous loci across the botryllid ascidians using Anchored Hybrid Enrichment (AHE) [70]. The pre-existing genomic resources included an assembled genome of Botryllus schlosseri [71], and two assembled transcriptomes: Botryllus schlosseri [72], and Botrylloides leachii [73], recently re-assigned to Botrylloides diegensis in [23].  In order to better represent the high diversity of the botryllid group, we collected low-coverage, whole genome data assemblies for seven additional species (details are given in Supplementary Table S2).  DNA extracts for these seven species were sent to the Center for Anchored Phylogenomics (http:www.anchoredphylogeny.com) for processing. In brief, after the quality/quantity of DNA was assessed using Qubit, Illumina libraries with single 8bp indexes were prepared following [74], with modifications described in [75]. Libraries were pooled and sequenced on two Illumina HiSeq2500 lanes with a paired-end 150bp protocol. A total of 125Gb of data was collected yielding 25-65x coverage per species. Reads were filtered for quality using the Cassava high chastity filter, demultiplexed with no mismatches tolerated, and merged to remove sequence adapters [76] prior to downstream  processing.                                                                                                                           

             In order to identify suitable conserved targets for AHE, we performed reciprocal blast on local machines at the Center for Anchored Phylogenomics using the two assembled transcriptomes (blastn). Using the results from the blast searches, we identified 482 preliminary targets with matching transcripts, which we aligned using MAFFT v7.023b [77]. Alignments were manually inspected in Geneious (vR9, Biomatters Ltd., Kearse et al. 2012), then trimmed to regions that were well-aligned. For the remainder of the locus development/identification, we followed the protocol outlined in [78]. More specifically, we isolated the Botryllus schlosseri (transcriptome) sequences from the aforementioned alignments, and using those as a reference scanned the Botryllus schlosseri genome for the AHE regions. Regions of 10,000 bp containing a 17 of 20 initial spaced k-mer match, followed by a 55 of 100 confirmation match to one of the references were kept. K-mers are all of a sequence’s subsequences of length = k.  For example, the sequence GCTA would have the following k-mers: G, C, T, A, GC, CT, TA, GCT, CTA, and GCTA.  K-mers from the Botryllus schlosseri transcriptome were used to search the Botryllus schlosseri genome for AHE regions, and matches were based on spaced seeds as described in [79]. We then aligned (using MAFFT), the best matching genome sequence for each locus to the two transcriptome-derived sequences for that locus. Using Geneious (vR9, Biomatters Ltd.), we identified well-aligned regions of each three-sequence alignment and trimmed the alignments accordingly.  The three-sequence alignment contained only two species: Botryllus schlosseri and Botrylloides leachii (recently re-assigned as Botrylloides diegensis) [23].                                                

In order to incorporate whole genome sequencing (WGS) data from the seven additional species, we utilized sequences from Botrylloides leachii and Botryllus schlosseri in the alignments as references. Each WGS read was checked against the reference database and reads with a preliminary 17 of 20 initial spaced k-mer match, followed by a final 55 of 100bp consecutive match were retained, then aligned by locus to form seeds for an extension assembly that allowed flanking regions to be recovered (see [78] for details and scripts). In order to construct the final alignments, the (up to) 10 sequences for each locus were aligned in MAFFT, then trimmed to well-aligned regions after inspection in Geneious (vR9, Biomatters Ltd.). In order to avoid problems associated with missing data in downstream projects [80], loci represented by less than 50% of the sequences in the alignment were removed from downstream analysis. When alignments from two loci were found to be overlapping (i.e. containing some of the same 20-mers), one locus was removed to ensure that each locus was a unique target. Lastly, we checked for repetitive elements by profiling the k-mers found in the alignments with respect to their occurrence in the WGS reads. Regions with a substantially elevated k-mer coverage were masked. A total of 200 AHE targets resulted from the process.  Supplementary Table S3 contains the size of each locus, and genomic position of each locus in the Botryllus schlosseri genome, as determined from the best blastn match to the Botryllus schlosseri genome assembly using the locus sequence as a query. Finally, in-silico probes were tiled uniformly across the 10 sequences for each locus at 3.5x coverage depth. A total of 54,350 probes covered the 200 AHE targets (total target size ~ 139 kb) that resulted from the process.  These loci were successfully amplified in Symplegma brakenhielmi, to provide an outgroup for the phylogenomic tree.  These loci will therefore be useful for Symplegma, which is the sister group to the Botrylloides/Botryllus clade [81].  The utility of these loci beyond the genera Botrylloides, Botryllus and Symplegma has not been investigated.

Raw Read Alignment

Sequence reads were demultiplexed with no mismatches tolerated and filtered for quality using the Illumina CASAVA pipeline with a high chastity setting. Overlapping reads were identified and merged using the approach described by [76]. This process removes sequence adapters and corrects sequencing errors in overlapping regions. Reads were then assembled using the quasi-de novo approach described by [78]. This assembly approach uses divergent references to identify sequences coming from conserved regions to which reads can be mapped. The mapped reads are in turn used as references when the assembly is extended into less conserved regions (see [78] for details). Probe region sequences from eight of the nine species used in the probe design were used as references for the initial mapping, while sequences from the Botryllus schlosseri genome (the 9th species) served as the primary reference. Consensus sequences were constructed from assembly clusters containing greater than an average of 250 reads. Ambiguity codes were employed for sites in which base frequencies could not be explained by a 1% sequencing error.

Funding

National Science Foundation, Award: EPSCoR Research Enhancement Grant

National Science Foundation, Award: FSML Grant 0435033

National Science Foundation, Award: DBI 1257630

California Sea Grant, University of California, San Diego

National Science Foundation, Award: Biotic Survey VIP Expedition