Convergent genomic signatures associated with vertebrate viviparity

Eastment, Rhiannon 1 ; Wong, Bob1; McGee, Matthew1

Published Feb 21, 2024 on Dryad. https://doi.org/10.5061/dryad.rn8pk0pjx

Data files

Feb 21, 2024 version files 37.99 GB

01_genome_alignment.zip

21.65 MB
03_species_tree_inference.zip

3.12 KB
04_neutral_phylogenetic_model.zip

3.33 KB
05_time_tree.zip

3.29 KB
06_protein_family_analysis.zip

37.96 GB
07_phyloacc.zip

249.55 KB
08_positive_selection.zip

4.30 KB
README.md

15.08 KB

Abstract

Viviparity—live birth—is a complex and innovative mode of reproduction that has evolved repeatedly across the vertebrate Tree of Life. The genetic basis of viviparity has garnered increasing interest over recent years, however such studies are often undertaken on small evolutionary timelines, and thus are not able to address changes occurring on a broader scale. Using whole genome data, we investigated the molecular basis of this innovation across the diversity of vertebrates to answer a long held question in evolutionary biology: is the evolution of convergent traits driven by convergent genomic changes? This dataset includes the scripts and files used to investigate the genomic basis of viviparity in vertebrates. Specifically, we use genome alignments to investigate changes to protein families, protein-coding regions, introns and untranslated regions (UTRs). We assess changes in the sizes of protein families, as well as analyse differences in substitution rates in coding and noncoding sequences.

https://doi.org/10.5061/dryad.rn8pk0pjx

Viviparity—live birth—is a complex and innovative mode of reproduction that has evolved repeatedly across the vertebrate Tree of Life. Using whole genome data, we investigated the molecular basis of this innovation across the diversity of vertebrates to determine whether the evolution of convergent traits driven by convergent genomic changes. This dataset includes the scripts and files used to investigate the genomic basis of viviparity in vertebrates. Specifically, we use newly sequenced and publicly available whole genome data to generate genome alignments to assess changes in substitution rates in coding regions, introns and untranslated regions (UTRs). We additionally aligned genomes to the Pfam database to analyse differences in protein family sizes.

Description of the data and file structure

Data are organised by analysis, each of which are described below.

01_genome_alignment: We generated two multiple-genome alignments: a ‘default’ alignment comprising 27 vertebrate genomes, and an ‘extended’ alignment comprising 51 vertebrate genomes. Genomes were either sequences and assembled by the authors or sourced from publicly available datasets.

02_extract_and_align_CDS_and_nonCDS: Coding sequences (CDS), introns and UTRs (nonCDS) were extracted from both the 'default' and 'extended' datasets. Sequences were then realigned to account for any initial alignment errors.

03_species_tree_inference: Species trees were generated for both the 'default' and 'extended' datasets using UTR and intron alignments.

04_neutral_phylogenetic_model: A neutral phylogenetic model was generated for both the 'default' and 'extended' datasets using fourfold degenerate (4d) sites from the respective multi-genome alignments.

05_time_tree: Two timetrees were generated for the 'extended' dataset, each one using either maximum or minimum ages of divergence.

06_protein_family_analysis: Genomes from the 'extended' dataset were aligned to the Pfam database, an online repository containing the annotations and multiple sequence alignments of over 19,000 protein families. These pairwise alignments were then used to determine the sizes of protein families for each species, and analyse the relationship between protein family size and reproductive mode. One protein family in particular, Ubi-N-Sde2, was further analysed to generate sequence alignments and phylogenetic trees.

07_phyloacc: PhyloAcc was used to assess differences in substitution rates between viviparous and oviparous species.

08_positive_selection: We tested coding alignments from the 'default' dataset for evidence of positive selection to determine whether viviparous taxa experience convergent shifts in amino acid substitutions.

01_genome_alignment

51_vertebrates.rn.maf.gz: 'Extended' multiple-genome alignment comprising 51 vertebrate genomes, with the sequence header for human sequences renamed to reflect the associated chromosome
51_vertebrates.rn.maf.gz: 'Default' multiple-genome alignment comprising 27 vertebrate genomes, with the sequence header for human sequences renamed to reflect the associated chromosome

03_species_tree_inference

default_nonCDS_spptree.treefile: Species tree generated using UTR and intron alignments from the 'default' dataset
extended_nonCDS_spptree.treefile: Species tree generated using UTR and intron alignments from the 'extended' dataset

04_neutral_phylogenetic_model

51_vertebrates_4d.mod: Neutral phylogenetic model for the 'extended' dataset
27_vertebrates_4d_anc.mod: Neutral phylogenetic model for the 'default' dataset, with ancestral branches labelled

05_time_tree

max_tt.timetree.nex: Dated phylogenetic tree of taxa from the ‘extended’ dataset generated using maximum ages of divergence
min_tt.timetree.nex: Dated phylogenetic tree of taxa from the ‘extended’ dataset generated using minimum ages of divergence

06_protein_family_analysis

Pfam.renamed.fa.gz: Fasta file of sequences in the Pfam database
/blasttabs.gz/: Pairwise alignments in blasttab format. Generated by aligning genomes to the Pfam database.
pfam_counts.csv: Contains the number of sequences within each protein family for each species in the 'extended' dataset
Sde2_all.fasta: All Ubi-N-Sde2 sequences from the 'extended' dataset
Sde2_clusters_aln.fasta: Clustered and aligned sequences of Ubi-N-Sde2
Sde2_clusters.treefile: Tree generated using clustered and aligned sequences of Ubi-N-Sde2
UBC.aln.fasta: UBC sequence alignment generated using MAFFT
UBC_consensus.fasta: UBC sequence alignment generated using BAli-Phy
UBB.aln.fasta: UBB sequence alignment generated using MAFFT
UBB_consensus.fasta: UBB sequence alignment generated using BAli-Phy
mammals_sde2_NT.aln.fasta: Sequence alignment of mammalian Ubi-N-Sde2 sequences
mammals_sde2.treefile: Tree generated from mammalian Ubi-N-Sde2 sequences

Data/07_phyloacc/

CDS_elem_lik.txt: PhyloAcc output using coding sequences. Maximum log-likelihood configurations of latent state Z under null, accelerated and full model, with Z=-1 (if the element is 'missing' in the branches of outgroup species), 0 (background), 1 (conserved), 2 (accelerated). Each row corresponds to an input element and each column a branch in the tree.
CDS_rate_postZ_M2.txt: PhyloAcc output using coding sequences. Posterior median of conserved rate, accelerated rate, probability of gain and loss conservation, and posterior probability of being in each latent state on each branch for each element. *_3 indicates the posterior probability in the accelerated state.
nonCDS_elem_lik.txt: PhyloAcc output using UTRs and introns. Maximum log-likelihood configurations of latent state Z under null, accelerated and full model, with Z=-1 (if the element is 'missing' in the branches of outgroup species), 0 (background), 1 (conserved), 2 (accelerated). Each row corresponds to an input element and each column a branch in the tree.
nonCDS_rate_postZ_M2.txt: PhyloAcc output using UTRs and introns. Posterior median of conserved rate, accelerated rate, probability of gain and loss conservation, and posterior probability of being in each latent state on each branch for each element. *_3 indicates the posterior probability in the accelerated state.

08_positive_selection

27_vertebrates_4d_foreground.mod: Phylogenetic tree used as input to PAML, with foreground branches labelled.
cdsconc.cf.tree: Tree with concordance factors generated using coding sequences from the 'default' dataset
nceconc.cf.tree: Tree with concordance factors generated using UTRs and introns from the 'default' dataset

Code/Software

01_genome_alignment

01_process_and_align_genomes.txt: Processes and aligns genomes to create a multi-genome alignment

02_extract_and_align_CDS_and_nonCDS

01_sort_alignment_and_annotation.txt: Renames the human sequence headers in the multi-genome alignment and human annotation file to include the chromosome number, while also creating two new annotation files that define the location of coding sequences and UTRs and introns in the human genome. It additionally splits the annotation files and alignment by chromosome, which is required by certain software such as PHAST.
02_run_cds_noncds_extraction.txt: Runs maffilter to extract CDS, UTRs and introns from multi-genome alignment
03_extract_noncds.txt: Maffilter parameter file to extract UTRs and introns
04_extract_cds.txt: Maffilter parameter file to extract CDS
05_get_multifa_coordinates.txt: obtains the coordinates for fastas corresponding to CDS, UTR and intron alignments
06_find_duplicate_fastas.Rmd: R notebook for finding and condensing duplicate alignment files
07_align_multifas_via_mafft.txt: Uses MAFFT to realign sequences