We hypothesized that fusion of genes acquired via horizontal gene transfer (HGT) with endogenous sequences in arthropod genomes might generate what we call “HGT-chimeras”: genes with regions of non-metazoan and metazoan descent in the same open reading frame. This dataset supports the study of these HGT-chimeras presented in our manuscript “Evolutionary innovation through fusion of sequences from across the tree of life”. It includes input data and intermediate output files used in our HGT-chimera detection pipeline, as well as in the downstream bioinformatic characterization of these genes. The repository contains FASTA files of protein sequences, clustering results, phylogenetic trees, and tabular summaries of inferred HGT-chimeras, along with downstream analyses describing sequence molecular evolution (dN/dS), phylogenetic origin, gene expression, and domain architecture. Files are organized to correspond with steps in the associated GitHub pipeline, beginning with input clustering data (mmseq_cluster_representatives_with_missing.fasta) and concluding with analyses of representative HGT-chimeras highlighted in the manuscript’s figures. These data can be reused to validate our findings, extend analyses of discovered HGT-chimeras, or adapt the included pipeline for other genomic datasets. No ethical or legal restrictions apply to the data, which are derived from available genome assemblies and annotation data on NCBI.

Dataset DOI: 10.5061/dryad.t1g1jwtdz

Description of the data and file structure

Full details of data processing and analysis are described in the accompanying manuscript and GitHub repository.

Files and variables

mmseq_cluster_representatives_with_missing.fasta

FASTA file of 610,359 proteins as input to the HGT-chimera detection pipeline. Obtained via MMseqs2 clustering of proteins from 319 RefSeq arthropod genome annotations supplemented with 11 proteins from the same annotations that were obtained in a previous pilot iteration of this pipeline. FASTA headers have been set as "genome accession;protein accession".

round1_diamond_output.tar.gz

Tabular output of DIAMOND BLASTp search of mmseq_cluster_representatives_with_missing.fasta vs NR, with standard DIAMOND BLASTp outfmt -6 column fields, as described in https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options. Split to have one TSV per query.

interval_demarcation_results.txt

Results of interval demarcation algorithm on round 1 DIAMOND BLASTp. Each line contains a single protein accession, followed by a list of intervals that are labeled "Meta" for those inferred to be of ancient metazoan ancestry and "HGT" for those to be of non-metazoan/HGT ancestry.

split_intervals.fasta

FASTA file with slices of protein queries corresponding to demarcated intervals +/-10 intervals, used for queries in round 2 DIAMOND BLASTp. FASTA headers are set as protein_accession;genome_accession;annotation_(start,end) where annotation= HGT or Meta and start/end=start/stop position of interval (not+/-10).

round2_diamond_output_split.tar.gz

Tabular output of DIAMOND BLAST search of split_intervals.fasta vs NR, with standard DIAMOND BLASTp outfmt -6 column fields, as described in https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options. Split to have one TSV per query.

round2_blast_statistics_hgt_intervals.tsv

Table containing round 2 BLASTp statistics supporting HGT annotations for 768 HGT intervals. Column descriptions are as follows: "bitscore_max" maximum bitscore to a non-metazoan hit; "min_evalue" minimum e-evalue to a non-metazoan hit; "evalue_min_cov" length coverage of the query interval by the min e-value non-metazoan hit; "evalue_min_title" sequence title of the minimum hit; "evalue_min_sciname" scientific species name of the minimum hit; "evalue_min_phylum" taxonomic phylum of the minimum eval hit; "evalue_min_kingdom" taxonomic kingdom of the minimum hit; "n_meta_hits" number of non-arthropod metazoan hits; "n_hgt_taxids" number of unique non-metazoan NCBI taxids among the hits; "p_HGT300" proportion of non-metazoan NCBI taxids among the top 300 non-arthropod hits by lowest e-value; "AI" maximum alienness index of the non-metazoan hits; "N_AI>5" number of unique taxids with hits having alienness index >5. Hits were filtered to include those with>30% query coverage (cov) and exclude rotifer hits before tabulation.

round2_blast_statistics_meta_intervals.tsv

Table containing round 2 BLASTp statistics supporting metazoan annotations for 838 Meta intervals. Column descriptions are as follows: "bitscore_max" maximum bitscore to a non-arthropod metazoan hit; "min_evalue" minimum e-evalue to a non-arthropod metazoan hit; "evalue_min_cov" length coverage of the query interval by the min e-value non-arthropod metazoan hit; "evalue_min_title" sequence title of the minimum non-arthropod metazoan hit; "evalue_min_sciname" scientific species name of the minimum non-arthropod metazoan hit; "evalue_min_phylum" taxonomic phylum of the minimum non-arthropod metazoan hit; "evalue_min_kingdom" taxonomic kingdom of the minimum non-arthropod hit; "n_meta_hits" number of non-arthropod metazoan hits; "n_hgt_taxids" number of unique non-metazoan NCBI taxids among the hits; "p_Meta300" proportion of non-metazoan NCBI taxids among the top 300 non-arthropod hits by lowest e-value; "MI" maximum metazoan index of the non-metazoan hits (- of the alienness index); "N_MI>5" number of unique taxids with hits having metazoan index >5.

round2_chimera_intervals.txt

.txt representation of a dictionary mapping chimera names to their metazoan/HGT intervals for the 525 chimeras passing round 2 BLASTp-based filtering.

hmmbuild.tar.gz

Directory contains profile HMMs and upstream outputs for all HGT and metazoan intervals of 525 round 2 BLASTp-validated chimeras. Structure of directory is protein_name/interval_name/data. Data files are as follows: "seq.fasta" full-length sequences with headers selected as genome;protein accession; "unique_seq.fasta" full-length sequences with redundant sequences removed; "sub_seq.fasta" FASTA with separated intervals from BLAST hit coordinates; "MSA_sub_seq.fasta" MUSCLE alignment built from "sub_seq.fasta"; "sub_seq.hmm" profile HMM from hmmbuild on "MSA_sub_seq.fasta".

hmmsearch_v_nr.tar.gz

Directory contains parsed hmmsearch "domtblout" results for search of interval profile HMMs vs the NR database. Column fields are standard domtblout fields described in detail at http://eddylab.org/software/hmmer/Userguide.pdf. Additional field “species” reports the species of the hit sequence.

hmmsearch_v_arthropod.gz

Directory contains parsed hmmsearch "domtblout" results for search of interval profile HMMs vs the arthropod database. Fields as in hmmsearch_v_nr, with an additional NCBI taxid field.

round2_chimeras_cdd_search.txt

Outputs of NCBI web-server CDD search on full-length sequences for 525 chimeras that passed round 2 BLAST filtering (). Used to identify ankyrin repeats for exclusion. Standard output fields: “Query” refseq protein accession; “Hit type” type of CDD model with a hit; ”PSSM-ID” numerical ID of hit; “From” protein coordinates of start; ”To” protein coordinates of end; ”E-Value” e-value of hit; ”Bitscore” of hit; ”Accession” cdd accession of hit; ”Short name” name of hit; ”Incomplete” domain C for complete/- for incomplete; ”Superfamily” of hit.

censor_repbase_hits.tsv

Parsed tabular HTML output of CENSOR transposable element (TE) annotations for full-length chimeras passing the ankyrin repeat filter. Column fields correspond to those described here with the addition of “TE_description” string description of TE hit and class; “TE_species” taxonomic species of Repbase hit;”TE_lineage” full taxonomic lineage of Repbase hit.

repbase_cds_translations.fasta

Parsed protein FASTA file from HTML output of CENSOR transposable element. Contains all CDS hits returned by CENSOR for all chimeras. Used for DIAMOND BLASTP to compute bitscores.

TE_bitscore_comparison.tsv

Table summarizing comparison of DIAMOND BLASTp hits of TEs vs top non-metazoan hits in NR. Column fields are as follows: “query interval” HGT interval name; “query species” HGT interval species; “TE_ID” Repbase ID of top TE hit; “TE_description” string description of top TE hit and class; “TE_pident” percent identity with top TE hit; “TE_bitscore” bitscore of top TE hit; “non_meta_top_stitle” sequence name of top non-metazoan hit in NR; “non_meta_top_pident” percent identity of top non-metazoan hit in NR; “non_meta_top_bitscore” bitscore of top non-metazoan hit in NR;”bit_ratio”= non_meta_top_bitscore/TE_bitscore.

transposon_ankyrin_filtered_round2_chimera_intervals.txt

.txt representation of a dictionary mapping chimeras (with keys "genome_accession; protein_accessions") to lists of their HGT- and metazoan ("Meta")-derived intervals, specified as protein coordinates (start, end).

blast_result_plots_no_arthropod.tar.gz

BLASTp plots for the 258 chimeras passing round 1 and round 2 BLASTp filtering criteria with exclusion of TEs and ankyrin repeats. X-axis indicates position of hit relative to query protein, while y-axis reports the log-transformed e-values. Hit sequences are colored according to their taxonomic origin. Note that these plots exclude arthropod hits.

Manual_Inspection_of_Blast_Plots.tsv

Tabular output file containing the results of manual inspection of BLASTp plots for the 258 chimeras passing round 1 and round 2 “BLAST filtering inBLAST_result_plots_no_arthropod”. Column descriptions are as follows: Protein: “genome_accession;protein name”; “annot”: “Yes” if passed manual inspection; “Note”: justification for elimination if any; HGT_intervals: numerical coordinates of HGT-demarcated interval; Meta_intervals: numerical coordinates of Metazoan-demarcated intervals.

arthropoda.accessions

Text containing arthropod protein accessions extracted from a BLAST-indexed local copy of NR. Used for taxonomic filtering in BLAST dataset construction.

suppressed_aedes_albopictus.fa

A FASTA file of XP_029735553, a secondary chimera of XP_021699539.1 recovered in the first pipeline iteration run on A. albopictus annotation release GCF_006496715.1, but that was later marked as a lncRNA in the A. albopictus annotation release current at the time of writing. We confirmed its expression and sequence via RT-PCR and Sanger sequencing after the first iteration, so manually added XP_029735553.1 back for consideration as a secondary chimera of XP_021699539.1.

previous_iteration_chimeras.fa

A FASTA file of chimeras recovered in the initial pipeline iteration. This was used to prioritize previously recovered representative chimeras in selection of cluster representative chimeras.

previous_iteration_secondary_chimeras.txt

.txt representation of a dictionary mapping previously recovered secondary chimeras to their primary chimeras. This is used to prioritize previously recovered secondary chimeras in selection of secondary chimeras for phylogenetic inference.

secondary_chimera_adjacency_list.txt

.txt representation of a dictionary mapping primary chimeras to their secondary arthropod chimeras selected from BLASTp and hmmsearch hits.

clustering_representative_seqs.txt

.txt representation of a dictionary mapping representative primary chimeras to other primary chimeras in the same orthologous cluster (primary chimera=independent sequence passing all upstream BLASTp, ankyrin, TE filters).

clustered_ankyrin_transposon_secondary_filtered_chimeras.txt

.txt representation of a dictionary mapping representative primary chimeras to their HGT and metazoan intervals for the 258 chimeras that passed all upstream filters and secondary chimera interval BLASTp checks.

clustered_ankyrin_transposon_secondary_filtered_chimeras.tsv

Tabular data with taxonomic information of the 258 chimeras passing clustering/secondary chimera and upstream filters. Fields are as follows: “chimera” representative chimera accession; “n_species” number of species chimera is found in (including secondary chimeras); “span” lowest taxonomic rank including all representative & secondary chimeras; “secondary_chimera_species “ all species that representative & secondary chimeras are found in; “secondary_chimera_sequences” accessions of representative & secondary chimeras; ”HGT_intervals” interval coordinates of HGT interval; “Meta_intervals” interval coordinates of metazoan interval in representative sequence; “cd-hit” NCBI cdd search results; “og” representative chimera found in original pipeline iteration.

phylogenetic_data_filtered.tar.gz

Directory containing phylogenetic datasets for 377 intervals. Each folder is named as follows: “genome accession; protein accession; HGT/Meta annot_(interval)”. Contents are as follows: all_sequences.fa: FASTA file with all extracted protein subintervals from BLAST/hmmer hits; combined_sequences_data.tsv tabular file with BLASTp or hmmer standard outputs for all sequences in the phylogeny along with taxonomic information; MSA.fasta: MUSCLE alignment of all_sequences.fa; trimmed_MSA.fasta: trimAl trimmed alignment used for iqtree inference; ml_tree.iqtree maximum likelihood iqtree output file; final_rooted.tree minimum ancestor deviation tree from ml_tree.iqtree; itol_color_strop.txt itol annotation file with color labels by taxon; itol_taxonomic_info.txt itol annotation file with text labels by taxonomy.

symbiogenomesdb.tsv

Download of arthropod symbionts from http://symbiogenomesdb.uv.es/ on 6/29/2025. Columns: node: NCBI taxid, name: symbiont scientific name.

Tree_manual_inspection_HGT.tsv

Data for manual inspection of HGT interval trees. Columns: Interval: HGTc protein accession and coordinates; Tree_annot: annotation assigned to the tree topology, “Yes” indicates accepted as an HGT; Note: justification for inclusion/exclusion as needed; donor: donor lineage determined by majority rule of sister and cousin branches; sister: taxa of closest non-arthropod sequence relatives; cousin: taxa of second-closest non-arthropod sequence relatives.

Tree_manual_inspection_Metazoan.tsv

Data for manual inspection of Metazoan interval trees. Columns same as Tree_manual_inspection_HGT.tsv, missing a donor column.

same_genome_prot_results.tar.gz

Results of within-genome DIAMOND BLASTp search for 222 "relative" intervals of 104 representative HGT-chimeras. Outputs are TSVs with column headers corresponding to standard BLASTp fields, with the addition of a “gene_name” field to represent the RefSeq accession of the gene from which the protein hit derives; “seqid” to represent the NCBI accession of the chromosome/scaffold the gene is found on; and “start”/”end” columns indicating the coordinates of the gene hit on the target chromosome.

prot_same_genome_distance.tar.gz

Data as in same_genome_prot_results, but with the addition of the following columns: “same_chromosome” indicates whether the gene of the BLASTp hit is found on the same chromosome/scaffold as the query chimera; “overlap_with_interval” shows the length of overlap of the gene hit with the gene of the query chimera (>15 nucleotide overlap is excluded); “distance” provides the distance in number of nucleotides between the gene of the BLAST hit and the gene of the chimera (not considered in manuscript results); “gene_distance” reports the number of intervening genes between the chimera and the hit.

interval_nuc.fasta

A nucleotide FASTA file containing separated sequences for HGT- and metazoan intervals, derived from CDS sequences. Used as an input for GC content and codon use analysis.

dnds.tar.gz

This directory contains raw output files of dN/dS analyses with PAML presented in SI tables 14-16 and Figure 3. Subdirectories with numerical names “1”- “23” contain outputs of whole gene dN/dS analysis for each of 23 HGT-chimera clusters with >1 representative in the main directory, and in internal subdirectories with names formatted “(start,end)” contain the results of fixed/site partition models for the codons corresponding to the amino acids in the range (start, stop) in the representative chimera for the cluster. Subdirectories named with full interval names “genome accession; protein accession; annotation type_(start, end)” contain the results of branch tests for differing dN/dS on chimeric and non-chimeric branches. Within each directory, the “concatenated_nuc.fasta”/”concatenated_prot.fasta” files contain nucleotide/protein sequences analyzed; “MSA_concatenated_prot.fasta" is a MUSCLE protein alignment; “trimmed_MSA_concatenated_prot.fasta” is the trimAl-trimmed alignment used for tree inference and codon alignment generation; “pal2nal.paml” and “pal2nal.codon” contain codon alignments in two different formats; standard outputs of IQ-TREE maximum likelihood tree inference have the prefix “rev_aa” with the final maximum likelihood tree used as an input to PAML named “tree.newick”; remaining files are standard PAML codeml outputs, with the primary output file parsed for the values reported in the manuscript found in “paml_output.out”.

cluster_3_hgt.tar.gz

Outputs of phylogenetic investigation of protein-wide alignment of proteins in HGT-chimera cluster 3 (SI Figure 7B). “Concatenated_prot.fasta” contains protein sequences for 104 cluster protein representatives, aligned to make “MSA_concatenated_prot.fasta”, trimmed with trimAl to make “trimmed_MSA_concatenated_prot.fasta”. Outputs with “rev_aa” prefix are IQ-TREE output files, with the maximum likelihood tree “rev_aa.treefile” annotated with the iTOL annotation files “itol_text.txt” and “tree_colors.txt”.

og_all_protein_tax_info.csv

This CSV contains the outputs of pipeline iteration 1 (not presented in the manuscript), which were used to define the mapping between sequencing results previously obtained from pipeline iteration 1 to the currently detected accessions under iteration 2. Columns are as follows: “protein accession” of chimeras; “in_search_set”: in an original search set of RefSeq genome annotations; “Metazoan_intervals”: coordinates of intervals of inferred metazoan ancestry”; “HGT_intervals”: coordinates of intervals of inferred HGT ancestry; “cluster” : numerical ID of the hgt-chimera cluster (NOTE: NOT necessarily corresponding to current cluster id); “taxid”: NCBI taxonomic id of the species; “tax_class”/”tax_order”/”tax_species”: taxonomic class, order, species of the HGT-chimera; “lineage”: a full dictionary specifying the taxonomic lineage of the species from which the chimera is derived.

new_old_mapping.tsv

Mapping between cluster identifiers in pipeline iteration 1 (“old”) and pipeline iteration 2 (“new”).

genbank_translations_may_2025.fa

Translations of sequenced RT-PCR products obtained via Geneious Prime.

PCR_result_alignments.tar.gz

Contains files used to generate the protein alignments between translations of RT-PCR products and expected sequences from RefSeq (“MSA_protein.fa”: MUSCLE alignment, “MSA_protein.png”: image of muscle alignment, “protein.fa”: unaligned protein inputs)

genbank_submission_07_23_2025.tar.gz

Contains files of RT-PCR+Sanger sequencing products (“genbank_submission_07_23_2025_genbank_nucleotide.fa”), predicted translations (“genbank_submission_07_23_2025_genbank_translations.fa”), and metadata specified according to standard GenBank inputs (“genbank_submission_07_23_2025_genbank_submit.TSV”) that were submitted to GenBank on 7/23/2035.

new_old_mapping.csv

Mapping between orthologous cluster identifiers from pipeline iteration 1 ("old") and pipeline iteration 2 ("new"), for clusters recovered in both iterations.

SraRunTable_Eurytemora.csv

Table of RNA-Seq SRA accessions and associated metadata used for differential expression analysis of copepod HGT-chimeras. Column headers are described here: https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=3&WebEnv=MCID_68f015d1d63b225889c6f8d8&o=acc_s%3Aa.

combined_eurytemora_counts.tsv

Combined raw RNA-Seq counts for Eurytemora transcriptomes, used as an input to DEseq2. Rows correspond to genes, columns to samples described in SraRunTable_Eurytemora.csv.

F10_v_not.csv

Sample metadata for DEseq2 analyses. Each row maps an SRA accession to its sample description: "not" is untreated, "ord" is non-symbiotic Vibrio, "F10" is symbiotic *Vibrio. *DEseq2 output for comparison of "F10" and "not" conditions (as in eurytemora_meta.tsv). Column headers are standard DEseq2 output fields as described here: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html.

ord_v_not.csv

DEseq2 output for comparison of "ord" and "not" conditions (as in eurytemora_meta.tsv). Column headers are standard DEseq2 output fields as described here: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html.

peritrophin_MSA.fa

MUSCLE multiple sequence alignment of peritrophin sequences from diverse arthropods.

carbohydrate_esterase_MSAc.fa

MUSCLE multiple sequence alignment of carbohydrate esterase sequences from across the tree of life.

amidase_MSA_protein.fa

MUSCLE multiple sequence alignment of amidase sequences from across the tree of life.

cluster_14_TSA_addition.zip

Zipped folder containing files related to the addition of transcriptome shotgun assembly sequences from copepods to the maximum likelihood trees for the HGT and metazoan intervals of cluster 14, as follows: "Copepod_tsas.csv" has accessions of TSAs and their corresponding species; "domtblout.tsv" files contain hmmsearch results in standard hmmer format for searches of HGT and metazoan profiles against a combined translation database of the TSA files; ".fasta" files contain raw multiple sequence alignments and trimmed alignments for HGT and metazoan intervals with added TSA sequences; ".iqtree" files contain logs of maximum likelihood tree inference; ".treefile" files contain Newick files for trees as presented on iTOL; ".txt" are annotation files for taxonomy entered into iTOL.

penaeus_topology_tests.zip

Zipped folder containing files related to constrained tree topology tests for Penaeus chimera cluster 12, as follows: ".fasta" file with the trimmed multiple sequence alignment for tree construction, "arthropod_constraint.iqtree" files containing IQ-TREE logs for constrained topology tests with alternative constrained topologies in "arthropod_constraint1.txt" and "arthropod_constraint2.txt"; "topo_tests.iqtree" contains Newick trees for three trees, beginning with the unconstrained maximum likelihood (HGT) tree and followed by the two alternative topologies; "topo_tests.iqtree" contains the results of the tree topology (AU) tests.

Code/software

All tsv or csv files can be viewed in Google Sheets or Microsoft excel, and .txt/.fasta files can be viewed in a text editor. Newick files and associated iTOL annotation text files for phylogenies can be visualized in iTOL (https://itol.embl.de/), and have already been uploaded to https://itol.embl.de/shared/rkapoor (under the
Arthropod HGT-chimera interval trees 8/27/2025 tab) for public access.

Data from: Evolutionary innovation through fusion of sequences from across the tree of life

Data files

Abstract

Description of the data and file structure

Files and variables

Code/software

Data from: Evolutionary innovation through fusion of sequences from across the tree of life

Data files

Abstract

README: Data from: Evolutionary innovation through fusion of sequences from across the tree of life

Description of the data and file structure

Files and variables

Code/software