EVE: Software to identify novel viral insertions in wild-caught arthropod hosts from next-generation short read data
Data files
Oct 07, 2024 version files 11.04 KB
-
blastdb.zip
8.40 KB
-
README.md
2.63 KB
Abstract
Eukaryotic genomes harbor sequences derived from non-retroviral RNA viruses, known as endogenous viral elements (EVEs) or non-retroviral integrated RNA virus sequences (NIRVS). These sequences represent a record of past infections and have been implicated in host anti-viral response. We have created a program to identify viral sequences integrated in a host genome. It begins with a specimen BAM file and outputs candidate NIRVS, along with putative host insertion sites and overlapping genomic features of the host genome in XML and visual formats, with minimal intermediary intervention. We ran through this software short-read data derived from the genomes of 222 wild-caught Aedes aegypti mosquitoes, from a dozen geographical regions, and located putative NIRVS from seven virus families. This program is as accurate as currently available software for NIRVS detection and represents a significant improvement in adaptability and user-friendliness. Furthermore, the flexibility of this pipeline allows the user to search for sequence integrations across the genome of any organism, as long as a query sequence database and a reference genome is provided. Potential extended applications include the identification of integrated transgenic sequences used for research or vector control strategies.
Molecular Ecology Resources, to appear
Authors
Jessen Havill
Dept. of Computer Science, Bucknell University
Lewisburg, PA 17837
Olivia Strasburg
Dept. of Biomedical Engineering, University of Virginia
Charlottesville, VA 22904
Tessy Udoh
Dept. of Computer Science, Denison University
Granville, OH 43023
Jacob E. Crawford
Verily Life Sciences
Mountain View, CA 94043
Andrea Gloria-Soria
Dept. of Entomology, The Connecticut Agricultural Experiment Station
New Haven, CT 06511
blastdb.zip
The blastdb directory contains sample BLAST databases to demo the EVE-X software. The data is geared toward the discovery of NIRVS in specimen Aedes aegypti genomes.
virusdb.*
This BLAST viral database (named virusdb) contains complete sequences for all known RNA virus families that contain arboviruses. Specifically, it contains sequences from the families/orders in the following table:
Family/Order | NCBI taxid |
---|---|
Bromoviridae | 39740 |
Chrysoviridae | 249310 |
Flaviviridae | 11050 |
Nodaviridae | 12283 |
Orthomyxoviridae | 11308 |
Partitiviridae | 11012 |
Picornaviridae | 12058 |
Sedoreovirinae | 689832 |
Togaviridae | 11018 |
Totiviridae | 11006 |
Tymoviridae | 249184 |
Mesoniviridae | 1312872 |
Bunyavirales | 1980410 |
Mononegavirales | 11157 |
It was created using the following command:
makeblastdb -in virusdb_no_retro_or_unverified.fasta -dbtype nucl -parse_seqids -title "virusdb" -out virusdb
virusdb_no_retro_or_unverified.fasta
This is the source FASTA file from which the viral database was created.
aegyptidb.*
This BLAST database was created from the Aedes aegypti AaegL5.0 reference genome with accession numbers NC_035107.1, NC_035108.1, and NC_035109.1.
EVE-X software
The software, written in Python, is available on GitHub and a snapshot of the software has been uploaded to Zenodo.
Specimen files
We have also deposited all 222 of the specimen BAM files used in our analysis in the SRA with BioProject accession number PRJNA1158798. These BAM files contain only the unmapped reads and their paired-end mates from the original specimen files. EVE-X will recognize this from the filename format and skip the first stage of the program that would otherwise extract the unmapped reads from the original files.