Comprehensive annotations of genes, transcripts, and proteins of three pea aphid genome assemblies

Deem, Kevin 1 ; Brisson, Jennifer1

Published Jan 21, 2026 on Dryad. https://doi.org/10.5061/dryad.s1rn8pknd

Data files

Jan 21, 2026 version files 1.32 GB

Ap_gene_annotations.zip

1.32 GB
README.md

5.42 KB

Abstract

Accurate genome assembly and annotation are crucial for analyses of duplication and gene family evolution. Short-read genome assemblies can mis-assemble newly duplicated genes, and gene prediction programs can break-up, merge, or miss genes, obscuring accurate gene content. Here, we leverage transcriptomic data from various life stages, morphs, and sexes of the pea aphid Acyrthosiphon pisum to produce more comprehensive gene annotations for two long-read genome assemblies, as well as a modified version of the reference assembly, corrected at a critical morph-determination locus called api. We integrated three RNA-seq-based transcript assembly methods (Trinity de novo, Trinity genome-guided, and Stringtie) and the ab initio method AUGUSTUS to produce gene models for all three assemblies using PASA. Proteins produced by these gene models were clustered with the pea aphid RefSeq proteins, as well as those from twenty other Eukaryotic species, using OrthoFinder. This dataset contains files for all PASA gene models (GTF format), transcripts, proteins, and the assemblies themselves (FASTA format). Additionally, the Orthogroup clustering information for all proteins from all methods for all assemblies is provided (TSV format). When these genome annotations are viewed in IGV, clicking on each transcript provides information on the closest orthologs from each species for each protein predicted to be coded by that transcript. The transcript and protein files can be use to search for pea aphid orthologs of proteins of interest. These data properly assemble previously mis-assembled genes and reveal a larger than expected amount of gene duplication, providing a valuable resource for studying gene family evolution in pea aphids.

Dataset DOI: 10.5061/dryad.s1rn8pknd

Description of the data and file structure

This dataset contains new annotations for three pea aphid genome assemblies: two newly released long-read assemblies for lines homozygous for the WL or WD allele at the api locus on the X-chromosome, as well as a modified reference assembly with this region corrected for the WL allele. Annotations were produced by integrating ab initio gene predictions with RNA-Seq-based transcript detection. RNA-Seq data were used from embryonic, nymphal, and adult male and female pea aphids, for both winged and wingless morphs when available. Additionally, protein clustering data for the gene models in both the long-read assemblies and the corrected chromosome-level reference assembly allows for the detection of missing or mis-assembled genes and their duplicates.

Files and variables

File: Ap_gene_annotations.zip

Description: Contains folders for each assembly (Ap_modified_reference, Ap_WD_male, Ap_WL_male) as well as the All_Proteins_Orthogroups.tsv and Naming_conventions.txt file. Each assembly's folder contains the .fasta genome sequence, as well as subfolders for AUGUSTUS (Braker [bk]), Trinity de novo (tdn), Trinity genome-guided (tgg), Stringtie (st), and PASA transcripts (.fasta), proteins (.faa), and genomic coordinates (.gtf). An additional annotated version of the PASA genomic coordinates (_annot.gtf) is provided in each assembly's PASA subfolder for viewing functional annotations within IGV.

Code/software

The assemblies were annotated with the ab initio and RNA-Seq based methods AUGUSTUS (within Braker3), Trinity (de novo and genome-guided) and Stringtie. RNA-Seq data was used from embryonic, nymphal, and adult stages of males and females, for both winged and wingless morphs when available.

These programs output different combinations of three file types: transcript sequences (.fasta), protein sequences (.faa) and genomic coordinates (.gtf). Not all programs produced all filetypes. In cases where a particular filetype was not directly output by the aforementioned software, protein coding sequences were predicted with Transdecoder, transcripts were mapped to genomes using gmap, and transcript sequences extracted from genomic coordinates using gff-read. These various protein-coding transcript predictions were clustered into integrated gene models using the software PASA. All transcripts, proteins, and genomic coordinates are provided in this dataset for each method and integrated PASA gene models, as well genome sequence files.

All ORFs (> 100aa's) produced by all transcripts for each method, as well as those in integrated gene models produced by PASA, were clustered with RefSeq proteins from 20 other animal species (mostly insects) using OrthoFinder. The clustering of these protein sequences can be found in the provided .tsv file, with naming conventions described in a provided .txt file. Additionally, an annotated version of the genomic coordinates (_annot.gtf) for the PASA gene models was produced for each assembly for viewing in IGV. When viewed in IGV, clicking on a transcript within a PASA cluster displays the first orthogroup member from each species for each ORF in that transcript.

Access information

Data was derived from the following sources:

Genomic sequence data

Unmodified reference assembly : GCF_005508785.2
WL allele assembly long read data : SRR8306868
WD allele assembly long read data : SRR10028116

Transcriptomic sequence data

Asexual Female Ovary Stage 0-17 Embryo SRR073574, SRR073575
F1 Sexual Female adult SRR1239441
F1 WL Asexual Female adult SRR1239440
F1 WD Asexual Female adult SRR1239439
F1 WL Male adult SRR1239447
F1 WD Male adult SRR1239446
Roc1 WL Asexual Female Stage 18 Embryo SRR9877495, SRR9877496, SRR9877497, SRR9877498
Roc1 WL Asexual Female Stage 20 Embryo SRR9877642, SRR9877643, SRR9877648, SRR9877649
F1 WL Asexual Female 1st instar nymph SRR32079974, SRR32080032, SRR32080031
F1 WL Asexual Female 2nd instar nymph SRR32080030, SRR32080029, SRR32080028
F1 WL Asexual Female 3rd instar nymph SRR32080027, SRR32080026, SRR32080025
Roc1 WD Asexual Female Stage 18 Embryo SRR9877491, SRR9877492, SRR9877493, SRR9877494
Roc1 WD Asexual Female Stage 20 Embryo SRR9877644, SRR9877645, SRR9877646, SRR9877647
F1 WD Asexual Female 1st instar nymph SRR32080034, SRR32080033, SRR32080022
F1 WD Asexual Female 2nd instar nymph SRR32080011, SRR32080000, SRR32079989
F1 WD Asexual Female 3rd instar nymph SRR32079978, SRR32079976, SRR32079975
409 WL Male Early Embryo SRR36435213, SRR36435212, SRR36435211
409 WL Male Late Embryo SRR36435210, SRR36435219, SRR36435218
409 WL Male 1st instar nymph SRR32080024, SRR32080023, SRR32080021
409 WL Male 2nd instar nymph SRR32080020, SRR32080019, SRR32080018
409 WL Male 3rd instar nymph SRR32080017, SRR32080016, SRR32080015
412 WD Male Early Embryo SRR36435221, SRR36435220, SRR36435217
412 WD Male Late Embryo SRR36435216, SRR36435215, SRR36435214
412 WD Male 1st instar nymph SRR32080014, SRR32080013, SRR32080012
412 WD Male 2nd instar nymph SRR32080010, SRR32080009, SRR32080008
412 WD Male 3rd instar nymph SRR32080007, SRR32080006, SRR32080005