Supporting data for: The de novo genome of the Black-necked Snakefly (Venustoraphidia nigricollis Albarda, 1891): A resource to study the evolution of living fossils
Data files
Dec 05, 2023 version files 1.35 GB
Abstract
Snakeflies (Raphidioptera) are the smallest order of holometabolous insects that have kept their distinct and name-giving appearance since the Mesozoic, probably since the Jurassic, and possibly even since their emergence in the Carboniferous, more than 300 million years ago. Despite their interesting nature and numerous publications on their morphology, taxonomy, systematics, and biogeography, snakeflies have never received much attention from the general public, and only a few studies were devoted to their molecular biology. Due to this lack of molecular data, it is therefore unknown, if the conserved morphological nature of these living fossils translates to conserved genomic structures. Here, we present the first genome of the species and of the entire order of Raphidioptera. The final genome assembly has a total length of 669 Mbp and reached a high continuity with an N50 of 5.07 Mbp. Further quality controls also indicate a high completeness and no meaningful contamination. The newly generated data was used in a large-scaled phylogenetic analysis of snakeflies using shared orthologous sequences. Quartet score and gene-concordance analyses revealed high amounts of conflicting signals within this group that might speak for substantial incomplete lineage sorting and introgression after their presumed re-radiation after the asteroid impact 66 million years ago. Overall, this reference genome will be a door-opening dataset for many future research applications, and we demonstrated its utility in a phylogenetic analysis that provides new insights into the evolution of this group of living fossils.
README: Supporting Data for: The de novo genome of the Black-necked Snakefly (Venustoraphidia nigricollis, Albarda, 1891): A resource to study the evolution of living fossils.
Summary:
This dataset contains a de novo genome assembly as well as a detailed annotation of the Black-necked Snakefly (Venustoraphidia nigricollis). The genome was assembled using PacBio HiFi reads using the software hifiasm. Repeats were first modeled de novo using RepeatModeler and found models were combined with the Hexapoda repeat library of RepBase before utilizing both in a RepeatMasker run to mask repective repetitive elements in the de novo assembly. The annotation was done using the BRAKER pipeline and transcriptome as well as protein data. A functional annotation is also included that was done using InterProScan. Furthermore, this dataset includes a collection of high quality orthologous sequences collected between all Raphidiidae using the BUSCO-to-Phylogeny pipeline and publicy available transcriptome data. These orthologs are provided as amino acid fasta sequences as well as a collection of gene trees constructed using IQTree and a maximum likelihood approach with 1000 bootstrap replications. These orthologs were eventually used for a large-scaled phylogenetic analysis of conflicting signals within the Raphidioptera.
File content and usage:
Vnig_TBG3334_EichkAsp_Assembly.fasta.gz: a zipped fata file containing the original de novo assembly without changes made by NCBI and used for all downstream analysis. Use gunzip Vnig_TBG3334_EichkAsp_Assembly.fasta.gz to extract the original file.
Vnig_TBG3334_EichkAsp_REPEAT.cat.gz: a zipped file containing the extensive output from RepeatMasker ran on the original de novo assemly using repeat models created with RepeatModeler and then Hexapoda repeat sequences from RepBase release 27.06. Use gunzip Vnig_TBG3334_EichkAsp_REPEAT.cat.gz to extract the original file.
Vnig_TBG3334_EichkAsp_REPEAT.out.gz: a zipped file containing the out from RepeatMasker as a table format. Use gunzip Vnig_TBG3334_EichkAsp_REPEAT.out.gz to extract the original file.
Vnig_TBG3334_EichkAsp_REPEAT.tbl.gz: a zipped table containing the summary information of RepeatMasker ran on the original de novo assembly. Use gunzip Vnig_TBG3334_EichkAsp_REPEAT.tbl.gz to extract the original file.
Vnig_TBG3334_EichkAsp_Anno-AA.fasta.gz: A zipped fasta file containing the amino acid sequences annotated using the BRAKER pipeline (Gabriel et al., 2023), featuring transcriptome data from (Vasilikopoulos et al., 2020) and protein data from OrthoDB (Kriventseva et al., 2019). Use gunzip Vnig_TBG3334_EichkAsp_Anno-AA.fasta.gz to extract the original file.
Vnig_TBG3334_EichkAsp_Anno-CDS.fasta.gz: A zipped fasta file containing the coding sequences annotated using the BRAKER pipeline (Gabriel et al., 2023), featuring transcriptome data from (Vasilikopoulos et al., 2020) and protein data from OrthoDB (Kriventseva et al., 2019). Use gunzip Vnig_TBG3334_EichkAsp_Anno-CDS.fasta.gz to extract the original file.
Vnig_TBG3334_EichkAsp_Anno-MODELS.gtf.gz: A zipped gtf file containing positional numbers and length data for the coding sequences annotated using the BRAKER pipeline (Gabriel et al., 2023), featuring transcriptome data from (Vasilikopoulos et al., 2020) and protein data from OrthoDB (Kriventseva et al., 2019). Use gunzip Vnig_TBG3334_EichkAsp_Anno-MODELS.gtf.gz to extract the original file.
Vnig_TBG3334_EichkAsp_Anno-MODELS.gff3.gz: A zipped gff3 file containing positional numbers and length data for the coding sequences annotated using the BRAKER pipeline (Gabriel et al., 2023), featuring transcriptome data from (Vasilikopoulos et al., 2020) and protein data from OrthoDB (Kriventseva et al., 2019). Use gunzip Vnig_TBG3334_EichkAsp_Anno-MODELS.gff3.gz to extract the original file.
Vnig_TBG3334_EichkAsp_Anno-FUNCTIONS.gff3.gz: A zipped gff3 file containing found functions for the genes annotated with the BRAKER pipeline using InterProScan (Jones et al., 2014). Use gunzip Vnig_TBG3334_EichkAsp_Anno-FUNCTIONS.gff3.gz to extract the original file.
Vnig_TBG3334_EichkAsp_Phylo-Orthologs.tar.gz: A compressed tar-ball containing 381 ortholog sequences in fasta format that were identified using the BUSCO-to-Phylogeny pipeline (Schneider et al., 2021) and 15 Raphidioptera transcriptomes as well as the in this project assembled genome. Use tar -xzvf Vnig_TBG3334_EichkAsp_Phylo-Orthologs.tar.gz Vnig_TBG3334_EichkAsp_Phylo-Orthologs to extract the tarball.
Vnig_TBG3334_EichkAsp_Phylo-Genetrees.tar.gz: A compressed tar-ball containing 381 gene trees in newick format (a simple text-based tree format) that were constructed using the BUSCO-to-Phylogeny pipeline (Schneider et al., 2021) and 15 Raphidioptera transcriptomes as well as the in this project assembled genome. Use tar -xzvf Vnig_TBG3334_EichkAsp_Phylo-Genetrees.tar.gz Vnig_TBG3334_EichkAsp_Phylo-Genetrees to extract the tarball and a tree visualization program to visualize the individual gene trees (e.g. iTOL).
corresponding author:
Magnus Wolf, magnus.wolf@senckenberg.de
Methods
The de novo reference genome was sequenced with PacBio HiFi reads. All HiFi reads were assembled using hifiasm 0.16.1 (Cheng et al., 2021; Cheng et al., 2022). Raw primary contigs were filtered for contamination using blobtools 1.1.1 (Laetsch & Blaxter, 2017). The filtered contigs were then polished using all HiFi reads. This was done by first mapping the HiFi reads to the filtered contigs using minimap 2.24 with options "-a -x map-hifi". The mapping results were sorted by coordinates using samtools 1.15 with options "-l 9 -O BAM". Duplicates were removed using picard 2.26.10 MarkDuplicates (https://github.com/ broadinstitute/picard) with the option "--REMOVE_DUPLICATES". The assembly fasta file and the duplicate filtered bam file were indexed with samtools faidx and samtools index, respectively. Variants were identified using DeepVariant 1.2 (https://github.com/google/deepvariant) with the option "--model_type=PACBIO". Resulting heterozygous variants were filtered out with bcftools 1.15 (Danecek et al., 2021) using the command "view" with the option "-f 'PASS' -i 'GT="1/1" --no-version -Oz". The compressed vcf file was then indexed using tabix from HTSlib 1.15 (Bonfield et al., 2021). Finally, bcftools consensus was used to generate the polished contigs from the filtered hifiasm contigs and the filtered variant set.
Repeats specific to V. nigricollis were identified using RepeatModeler 2.0.1 (Flynn et al., 2020) in combination with RepeatMasker 4.1.0 (www.repeatmasker.org/RepeatMasker/), RECON 1.08 (Bao & Eddy, 2002), RepeatScout 1.0.6 (Price et al., 2005), Tandem Repeats Finder 4.10 (Benson, 1999) and RMBlast 2.11.0+ (www.repeatmasker.org/rmblast/). RepeatModeler was run with the options “‑pa 16 ‑LTRStruct. Resulting repeat families were combined with all Hexapoda repeat sequences from RepBase release 27.06 (Bao et al., 2015) and used as input for RepeatMasker 4.1.2 together with the options "-xsmall -no_is -e ncbi -pa 16 -s".
A soft masked genome assembly was used for gene annotation as implemented in the BRAKER3 pipeline (Gabriel et al., 2023). This approach combines a de novo gene calling, transcriptome-based gene annotation using the transcriptome of V. nigricollis (Vasilikopoulos et al., 2020), and a homology-based gene annotation. For protein references, we combined the Arthropoda-specific protein collection from OrthoDB following the recommendations in the BRAKER user guide (www.github.com/Gaius-Augustus/BRAKER). The resulting proteome was tested for completeness using BUSCO v.5.4.75.3.1 (Manni et al., 2021) in “protein mode” and run against the insect-specific set of core genes. Functional annotation was done using InterProScan v5 (Jones et al., 2014).
Phylogenetic reconstruction was performed using the BUSCO-to-Phylogeny wrapper function (Schneider et al., 2021), the applied code is available on (www.github.com/mag-wolf/BUSCO-to-Phylogeny). Publicly available transcriptome data (Table S1) of other Raphidioptera species were downloaded from NCBI SRA, and short reads were assembled using Trinity v2.8.5 (Grabherr et al., 2011) with default parameters. The resulting transcriptomes, as well as the genome assembly constructed here, were annotated using the BUSCO v5.4.3 (Manni et al., 2021) function in short mode and restricted to the insecta_odb10 dataset of OrthoDB (Kriventseva et al., 2019). We extracted single copy orthologous sequences (SCOS) with no more than 25 % missing species and orthologous sequences were aligned using Mafft v7.475 (Katoh & Standley, 2013) with 1000 iterative refinements. Alignments were trimmed using ClipKit v1.1.3 (Steenwyk et al., 2020) in the “kpic-smart-gap” mode to allow for an additional smart-gap-based trimming. Based on the trimmed alignments, gene trees were constructed using IQtree v2.1.2 (Minh et al., 2020) with 1000 bootstrap replications each. We further filtered gene trees and alignments based on the maximum likelihood genetic distance calculated by IQtree. To do this, we removed orthologs in the 5 % and 95 % quantiles to avoid including misalignments and sequences with too little information for a meaningful tree construction.
References
- Baid G, Cook DE, Shafin K, Yun T, Llinares-López F, Berthet Q, Belyaeva A, Töpfer A, Wenger AM, Rowell WJ et al. 2023. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nature Biotechnology 41: 232–238.
- Bao W, Kojima KK, Kohany O. 2015. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA 6: 11.
- Bao Z, Eddy SR. 2002. Automated de novo identification of repeat sequence families in sequenced genomes. Genome research 12: 1269–1276.
- Benson G. 1999. Tandem repeats finder. A program to analyze DNA sequences. Nucleic acids research 27: 573–580.
- Bonfield JK, Marshall J, Danecek P, Li H, Ohan V, Whitwham A, Keane T, Davies RM. 2021. HTSlib. C library for reading/writing high-throughput sequencing data. GigaScience 10.
- Cheng H, Concepcion GT, Feng X, Zhang H, Li H. 2021. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18: 170–175.
- Cheng H, Jarvis ED, Fedrigo O, Koepfli K-P, Urban L, Gemmell NJ, Li H. 2022. Haplotype-resolved assembly of diploid genomes without parental data. Nature Biotechnology 40: 1332–1335.
- Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM et al. 2021. Twelve years of SAMtools and BCFtools. GigaScience 10.
- Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. 2020. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences of the United States of America 117: 9451–9457.
- Gabriel L, Brůna T, Hoff KJ, Ebel M, Lomsadze A, Borodovsky M, Stanke M. 2023. BRAKER3. Fully Automated Genome Annotation Using RNA-Seq and Protein Evidence with GeneMark-ETP, AUGUSTUS and TSEBRA. bioRxiv.
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q et al. 2011. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29: 644–652.
- Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G et al. 2014. InterProScan 5. Genome-scale protein function classification. Bioinformatics (Oxford, England) 30: 1236–1240.
- Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7. Improvements in performance and usability. Molecular biology and evolution 30: 772–780.
- Kriventseva EV, Kuznetsov D, Tegenfeldt F, Manni M, Dias R, Simão FA, Zdobnov EM. 2019. OrthoDB v10. Sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic acids research 47: D807-D811.
- Laetsch DR, Blaxter ML. 2017. BlobTools. Interrogation of genome assemblies. F1000Research 6: 1287.
- Li H. 2018. Minimap2. Pairwise alignment for nucleotide sequences. Bioinformatics (Oxford, England) 34: 3094–3100.
- Manni M, Berkeley MR, Seppey M, Zdobnov EM. 2021. BUSCO. Assessing Genomic Data Quality and Beyond. Current protocols 1: e323.
- Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, Haeseler A von, Lanfear R. 2020. IQ-TREE 2. New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular biology and evolution 37: 1530–1534.
- Price AL, Jones NC, Pevzner PA. 2005. De novo identification of repeat families in large genomes. Bioinformatics (Oxford, England) 21 Suppl 1: i351-8.
- Schneider C, Woehle C, Greve C, D'Haese CA, Wolf M, Hiller M, Janke A, Bálint M, Huettel B. 2021. Two high-quality de novo genomes from single ethanol-preserved specimens of tiny metazoans (Collembola). GigaScience 10.
- Steenwyk JL, Buida TJ, Li Y, Shen X-X, Rokas A. 2020. ClipKIT. A multiple sequence alignment trimming software for accurate phylogenomic inference. PLoS Biology 18: e3001007.
- Vasilikopoulos A, Misof B, Meusemann K, Lieberz D, Flouri T, Beutel RG, Niehuis O, Wappler T, Rust J, Peters RS et al. 2020. An integrative phylogenomic approach to elucidate the evolutionary history and divergence times of Neuropterida (Insecta. Holometabola). BMC Evolutionary Biology 20: 64.
Usage notes
Zipped fasta, cat, out, tbl and gff3 files must be extracted using gzip before usage. For a simple inspection, any text editor software will be enough to open these files afterwards. For further inspections of fasta files we recommend an alignment viewing tool like AliView. Otherwise, these files are usually used for further downstream analyses with plenty of options.
Compressed tar-balls must be extracted with “tar -x” to open the directory it holds. The text files within these directories (fasta and newick files) can again be opened with plain text editors. Newick files might be visualized in a tree visualization tool like iTOL or FigTree.