Comprehensive annotations of genes, transcripts, and proteins of three pea aphid genome assemblies
Data files
Jan 21, 2026 version files 1.32 GB
-
Ap_gene_annotations.zip
1.32 GB
-
README.md
5.42 KB
Abstract
Accurate genome assembly and annotation are crucial for analyses of duplication and gene family evolution. Short-read genome assemblies can mis-assemble newly duplicated genes, and gene prediction programs can break-up, merge, or miss genes, obscuring accurate gene content. Here, we leverage transcriptomic data from various life stages, morphs, and sexes of the pea aphid Acyrthosiphon pisum to produce more comprehensive gene annotations for two long-read genome assemblies, as well as a modified version of the reference assembly, corrected at a critical morph-determination locus called api. We integrated three RNA-seq-based transcript assembly methods (Trinity de novo, Trinity genome-guided, and Stringtie) and the ab initio method AUGUSTUS to produce gene models for all three assemblies using PASA. Proteins produced by these gene models were clustered with the pea aphid RefSeq proteins, as well as those from twenty other Eukaryotic species, using OrthoFinder. This dataset contains files for all PASA gene models (GTF format), transcripts, proteins, and the assemblies themselves (FASTA format). Additionally, the Orthogroup clustering information for all proteins from all methods for all assemblies is provided (TSV format). When these genome annotations are viewed in IGV, clicking on each transcript provides information on the closest orthologs from each species for each protein predicted to be coded by that transcript. The transcript and protein files can be use to search for pea aphid orthologs of proteins of interest. These data properly assemble previously mis-assembled genes and reveal a larger than expected amount of gene duplication, providing a valuable resource for studying gene family evolution in pea aphids.
Dataset DOI: 10.5061/dryad.s1rn8pknd
Description of the data and file structure
This dataset contains new annotations for three pea aphid genome assemblies: two newly released long-read assemblies for lines homozygous for the WL or WD allele at the api locus on the X-chromosome, as well as a modified reference assembly with this region corrected for the WL allele. Annotations were produced by integrating ab initio gene predictions with RNA-Seq-based transcript detection. RNA-Seq data were used from embryonic, nymphal, and adult male and female pea aphids, for both winged and wingless morphs when available. Additionally, protein clustering data for the gene models in both the long-read assemblies and the corrected chromosome-level reference assembly allows for the detection of missing or mis-assembled genes and their duplicates.
Files and variables
File: Ap_gene_annotations.zip
Description: Contains folders for each assembly (Ap_modified_reference, Ap_WD_male, Ap_WL_male) as well as the All_Proteins_Orthogroups.tsv and Naming_conventions.txt file. Each assembly's folder contains the .fasta genome sequence, as well as subfolders for AUGUSTUS (Braker [bk]), Trinity de novo (tdn), Trinity genome-guided (tgg), Stringtie (st), and PASA transcripts (.fasta), proteins (.faa), and genomic coordinates (.gtf). An additional annotated version of the PASA genomic coordinates (_annot.gtf) is provided in each assembly's PASA subfolder for viewing functional annotations within IGV.
Code/software
The assemblies were annotated with the ab initio and RNA-Seq based methods AUGUSTUS (within Braker3), Trinity (de novo and genome-guided) and Stringtie. RNA-Seq data was used from embryonic, nymphal, and adult stages of males and females, for both winged and wingless morphs when available.
These programs output different combinations of three file types: transcript sequences (.fasta), protein sequences (.faa) and genomic coordinates (.gtf). Not all programs produced all filetypes. In cases where a particular filetype was not directly output by the aforementioned software, protein coding sequences were predicted with Transdecoder, transcripts were mapped to genomes using gmap, and transcript sequences extracted from genomic coordinates using gff-read. These various protein-coding transcript predictions were clustered into integrated gene models using the software PASA. All transcripts, proteins, and genomic coordinates are provided in this dataset for each method and integrated PASA gene models, as well genome sequence files.
All ORFs (> 100aa's) produced by all transcripts for each method, as well as those in integrated gene models produced by PASA, were clustered with RefSeq proteins from 20 other animal species (mostly insects) using OrthoFinder. The clustering of these protein sequences can be found in the provided .tsv file, with naming conventions described in a provided .txt file. Additionally, an annotated version of the genomic coordinates (_annot.gtf) for the PASA gene models was produced for each assembly for viewing in IGV. When viewed in IGV, clicking on a transcript within a PASA cluster displays the first orthogroup member from each species for each ORF in that transcript.
Access information
Data was derived from the following sources:
Genomic sequence data
- Unmodified reference assembly : GCF_005508785.2
- WL allele assembly long read data : SRR8306868
- WD allele assembly long read data : SRR10028116
Transcriptomic sequence data
- Asexual Female Ovary Stage 0-17 Embryo SRR073574, SRR073575
- F1 Sexual Female adult SRR1239441
- F1 WL Asexual Female adult SRR1239440
- F1 WD Asexual Female adult SRR1239439
- F1 WL Male adult SRR1239447
- F1 WD Male adult SRR1239446
- Roc1 WL Asexual Female Stage 18 Embryo SRR9877495, SRR9877496, SRR9877497, SRR9877498
- Roc1 WL Asexual Female Stage 20 Embryo SRR9877642, SRR9877643, SRR9877648, SRR9877649
- F1 WL Asexual Female 1st instar nymph SRR32079974, SRR32080032, SRR32080031
- F1 WL Asexual Female 2nd instar nymph SRR32080030, SRR32080029, SRR32080028
- F1 WL Asexual Female 3rd instar nymph SRR32080027, SRR32080026, SRR32080025
- Roc1 WD Asexual Female Stage 18 Embryo SRR9877491, SRR9877492, SRR9877493, SRR9877494
- Roc1 WD Asexual Female Stage 20 Embryo SRR9877644, SRR9877645, SRR9877646, SRR9877647
- F1 WD Asexual Female 1st instar nymph SRR32080034, SRR32080033, SRR32080022
- F1 WD Asexual Female 2nd instar nymph SRR32080011, SRR32080000, SRR32079989
- F1 WD Asexual Female 3rd instar nymph SRR32079978, SRR32079976, SRR32079975
- 409 WL Male Early Embryo SRR36435213, SRR36435212, SRR36435211
- 409 WL Male Late Embryo SRR36435210, SRR36435219, SRR36435218
- 409 WL Male 1st instar nymph SRR32080024, SRR32080023, SRR32080021
- 409 WL Male 2nd instar nymph SRR32080020, SRR32080019, SRR32080018
- 409 WL Male 3rd instar nymph SRR32080017, SRR32080016, SRR32080015
- 412 WD Male Early Embryo SRR36435221, SRR36435220, SRR36435217
- 412 WD Male Late Embryo SRR36435216, SRR36435215, SRR36435214
- 412 WD Male 1st instar nymph SRR32080014, SRR32080013, SRR32080012
- 412 WD Male 2nd instar nymph SRR32080010, SRR32080009, SRR32080008
- 412 WD Male 3rd instar nymph SRR32080007, SRR32080006, SRR32080005
