Data from: The evolution of protein-coding gene structure in eukaryotes
Data files
Apr 02, 2024 version files 15.28 GB
-
DRYAD_README.txt
7.17 KB
-
README.md
7.74 KB
-
RESULT.tar.gz
15.28 GB
Abstract
Introns are highly prevalent in most eukaryotic genomes. Despite the accumulating evidence for benefits conferred by the possession of introns, their specific roles and functions, as well as the processes shaping their evolution, are still only partially understood. Here we explore the evolution of the eukaryotic gene intron-exon structure by focusing on several key features such as the intron length, the number of introns, and the intron-to-exon ratio of protein-coding genes. We utilize whole genome data from 590 species covering the main eukaryotic taxonomic groups and analyze them within a statistical phylogenetic framework. We found that the basic gene structure differs markedly among the main eukaryotic phyla, with animals, and particularly chordates, displaying intron-rich genes, compared to plants and fungi. Reconstruction of gene structure evolution suggests that these differences had evolved prior to the divergence of the phyla, and have remained mostly conserved within groups. We revisit the previously reported association between the genome size and the mean intron length, and report that the correlation patterns differ considerably among phyla. Our findings suggest that the evolution of introns may be affected by different processes across the eukaryotic tree. The substantial diversity in gene structures may indicate that introns play different molecular and evolutionary roles in different organisms.
README: The evolution of protein-coding gene structure in eukaryotes
https://doi.org/10.5061/dryad.zcrjdfnm1
Data related to the manuscript "The evolution of protein-coding gene structure in eukaryotes", currently in preparation.
Author Information
Corresponding Investigator
Name: Lior Glick
Institution: Tel Aviv University, Tel Aviv, Israel
Email: liorglic@mail.tau.ac.il
Co-investigator 1
Name: Silvia Castiglione
Institution: University of Naples Federico II, Naples, Italy
Co-investigator 2
Name: Gil Loewenthal
Institution: Tel Aviv University, Tel Aviv, Israel
Co-investigator 3
Name: Prof. Pasquale Raia
Institution: University of Naples Federico II, Naples, Italy
Co-investigator 4
Name: Prof. Tal Pupko
Institution: Tel Aviv University, Tel Aviv, Israel
Co-investigator 5
Name: Prof. Itay Matrose
Institution: Tel Aviv University, Tel Aviv, Israel
Description of the data and file structure
This data set contains an archive of a directory generated using the analysis workflow. It contains all files required for reproducing the post-analyses and obtaining the tables and figures included in the manuscript.
To extract the archive, use the command line:
tar -zxvf RESULT.tar.gz
Description of dataset
This DRYAD data set contains the results generated by the analysis pipeline described at: https://github.com/MayroseLab/gene_structure_evolution.
The archive contains a single directory. This directory contains two sub-directories:
per_species - contains analyses performed on each species individually (590 species in total)
Directories within per_species bear the species names, and each directory contains the following files:
- annotation.canon.gff3 - genome annotation downloaded from ENSEMBL (in GFF3 format), where only canonical mRNA features were retained
- annotation.canon.gff3.mRNA_to_canon - a table with the translation between mRNA feature IDs and the respective canonical mRNA ID.
- annotation.canon.gff3.gene_to_mRNA - a table with the translation between gene feature IDs and the respective canonical mRNA ID.
- annotation.canon.introns.gff3 - same as annotation.canon.gff3, but with intron features added
- prot.fasta - protein sequences of all transcript annotations (headers are the transcript IDs)
- prot.canon.fasta - same as prot.fasta, but only containing canonical transcripts
- genome.size - contains the species name and the size (in bp) of the genome assembly in ENSEMBL
- intron_lengths.stats - a table containing gene structure summary stats for all canonicaltranscripts (see details under all_species)
- BUSCO.stats - a table containing gene structure stats for canonical transcripts identified as BUSCOs (see details under all_species)
- BUSCO - a directory containing the results of the BUSCO analysis on the proteins set of the species. See details at: https://busco.ezlab.org/busco_userguide.html#protein-mode
all_species - contains analyses performed on all species together (e.g., comparisons among species).
The files contained in this directory are:
- intron_lengths.stats - a table containing summary gene structure stats per species.
Each species appears twice (see the Dataset column description). The relevant columns are:
- Min - length of shortest intron
- Max - length of longest intron
- Mean - mean intron length
- STD - standard deviation of intron lengths distribution
- Q<10,25,50,75,90> - quantiles of intron lengths distribution
- Mean_exon - mean exon length
- Transcripts_count - number of canonical transcripts in the annotation
- Transcripts_containing_introns - number of canonical transcripts with at least one intron
- Mean_per_transcript - mean number of introns per transcript
- Mean_total_intron_length_per_transcript - intron lengths per canonical transcript were summed, and the mean across all transcripts was calculated
- Mean_total_exon_length_per_transcript - exon lengths per canonical transcript were summed, and the mean across all transcripts was calculated
- Mean_intron_fraction - the fraction (0-1) of intronic sequences per canonical transcript was calculated (total intron length / total transcript length), and the mean across all transcripts was taken
- Mean_intron_ratio - the ratio of intronic to exonic sequences per canonical transcript was calculated (total intron length / total exon length), and the mean across all transcripts was taken
- Dataset - either "all" or "all_log", representing raw stats and stats obtained by appling a log10-transformation to the intron lengths distribution.
- group - one of fungi, metazoa, plants, protists, or vertebrates. Indicating the ENSEMBL DB from which the species originates
- phylum
- class - one of the six taxonomic classes analyzed in this study, or blank otherwise
- Genome_size - in bp, based on genome assembly found in ENSEMBL
-
*KS_dist. - where feature is one of intron_counts, intron_fractions, intron_ratios, or intron_lengths. Files with the .tsv suffix are matrices containing all pairwise Kolmogorov-Smirnoff distances between species, based on the feature distributions. Files with the .phylip suffix contain the same matrices as .tsv files in PHYLIP format. Files with the .phylip_fastme_stat.txt contain run logs of FastME on phylip inputs Files with the .nwk suffix are Newick format trees obtained by running FastME on the KS matrices (see Methods in manuscript) - BUSCO*
.stats - where stat is one of intron_frac,intron_lens,n_introns,protein_len,total_exon_len,total_intron_len,transcript_id Matrices with the stats of specific genes identified as BUSCOs in each species - BUSCO_sequences - a directory containing fasta files with protein sequences, grouped by BUSCO ID. Files are divided by class, and both the unaligned (
.faa) and aligned ( _MSA.faa) files are provided. - BUSCO_trees - a directory containing trees inferred based on the BUSCO MSAs using IQ-tree. The relevant files containing the Newick trees are
.treefile -
_change.tsv - where class is one of six analyzed classes. The files contain tables with evolutionary change calculations per BUSCO ID. See the manuscript for further details. -
.list - where class is one of six analyzed classes. The list of species belonging to a class (one per line) - busco_downloads - a directory containing lineage information downloaded automatically by BUSCO
- BUSCO_sequences - a directory containing protein sequences aggregated by BUSCO ID. The directory contains six subdirectories, one for each class. Each class directory contains fasta files with the name convention
.faa and MSA.faa, representing unaligned and aligned (MAFFT) sequences. Within each fasta file, record headers follow the convention _ - BUSCO_change - a directory containing results of the evolutionary change analysis performed per BUSCO (see Methods in manuscript). The directory contains six subdirectories, one for each class. Each class directory contains one file per BUSCO, named
.tsv, with three columns: Total_tree_length, Total_sequence_change, and Total_gene_structure_change
Code/Software
This data set is meant as input for the code found at the relevant Github repo. Please see there for further details.