Convergent genomic signatures associated with vertebrate viviparity
Data files
Feb 21, 2024 version files 37.99 GB
-
01_genome_alignment.zip
21.65 MB
-
03_species_tree_inference.zip
3.12 KB
-
04_neutral_phylogenetic_model.zip
3.33 KB
-
05_time_tree.zip
3.29 KB
-
06_protein_family_analysis.zip
37.96 GB
-
07_phyloacc.zip
249.55 KB
-
08_positive_selection.zip
4.30 KB
-
README.md
15.08 KB
Abstract
Viviparity—live birth—is a complex and innovative mode of reproduction that has evolved repeatedly across the vertebrate Tree of Life. The genetic basis of viviparity has garnered increasing interest over recent years, however such studies are often undertaken on small evolutionary timelines, and thus are not able to address changes occurring on a broader scale. Using whole genome data, we investigated the molecular basis of this innovation across the diversity of vertebrates to answer a long held question in evolutionary biology: is the evolution of convergent traits driven by convergent genomic changes? This dataset includes the scripts and files used to investigate the genomic basis of viviparity in vertebrates. Specifically, we use genome alignments to investigate changes to protein families, protein-coding regions, introns and untranslated regions (UTRs). We assess changes in the sizes of protein families, as well as analyse differences in substitution rates in coding and noncoding sequences.
https://doi.org/10.5061/dryad.rn8pk0pjx
Viviparity—live birth—is a complex and innovative mode of reproduction that has evolved repeatedly across the vertebrate Tree of Life. Using whole genome data, we investigated the molecular basis of this innovation across the diversity of vertebrates to determine whether the evolution of convergent traits driven by convergent genomic changes. This dataset includes the scripts and files used to investigate the genomic basis of viviparity in vertebrates. Specifically, we use newly sequenced and publicly available whole genome data to generate genome alignments to assess changes in substitution rates in coding regions, introns and untranslated regions (UTRs). We additionally aligned genomes to the Pfam database to analyse differences in protein family sizes.
Description of the data and file structure
Data are organised by analysis, each of which are described below.
01_genome_alignment: We generated two multiple-genome alignments: a ‘default’ alignment comprising 27 vertebrate genomes, and an ‘extended’ alignment comprising 51 vertebrate genomes. Genomes were either sequences and assembled by the authors or sourced from publicly available datasets.
02_extract_and_align_CDS_and_nonCDS: Coding sequences (CDS), introns and UTRs (nonCDS) were extracted from both the 'default' and 'extended' datasets. Sequences were then realigned to account for any initial alignment errors.
03_species_tree_inference: Species trees were generated for both the 'default' and 'extended' datasets using UTR and intron alignments.
04_neutral_phylogenetic_model: A neutral phylogenetic model was generated for both the 'default' and 'extended' datasets using fourfold degenerate (4d) sites from the respective multi-genome alignments.
05_time_tree: Two timetrees were generated for the 'extended' dataset, each one using either maximum or minimum ages of divergence.
06_protein_family_analysis: Genomes from the 'extended' dataset were aligned to the Pfam database, an online repository containing the annotations and multiple sequence alignments of over 19,000 protein families. These pairwise alignments were then used to determine the sizes of protein families for each species, and analyse the relationship between protein family size and reproductive mode. One protein family in particular, Ubi-N-Sde2, was further analysed to generate sequence alignments and phylogenetic trees.
07_phyloacc: PhyloAcc was used to assess differences in substitution rates between viviparous and oviparous species.
08_positive_selection: We tested coding alignments from the 'default' dataset for evidence of positive selection to determine whether viviparous taxa experience convergent shifts in amino acid substitutions.
Data/
| - 01_genome_alignment/
| | - 51_vertebrates.rn.maf.gz
| | - 27_vertebrates.rn.maf.gz
| - 03_species_tree_inference/
| | - default_nonCDS_spptree.treefile
| | - extended_nonCDS_spptree.treefile
| - 04_neutral_phylogenetic_model/
| | - 51_vertebrates_4d.mod
| | - 27_vertebrates_4d_anc.mod
| - 05_time_tree/
| | - max_tt.timetree.nex
| | - min_tt.timetree.nex
| - 06_protein_family_analysis/
| | - Pfam.renamed.fa.gz
| | - blasttabs.gz/
| | | - *.blasttab
| | - pfam_counts.csv
| | - Sde2_all.fasta
| | - Sde2_clusters_aln.fasta
| | - Sde2_clusters.treefile
| | - UBC.aln.fasta
| | - UBC_consensus.fasta
| | - UBB.aln.fasta
| | - UBB_consensus.fasta
| | - mammals_sde2_NT.aln.fasta
| | - mammals_sde2.treefile
| - 07_phyloacc/
| | - CDS_elem_lik.txt
| | - CDS_rate_postZ_M2.txt
| | - nonCDS_elem.lik.txt
| | - nonCDS_rate_postZ_M2.txt
| - 08_positive_selection/
| | - 27_vertebrates_4d_foreground.mod
| | - cdsconc.cf.tree
| | - nceconc.cf.tree
01_genome_alignment
- 51_vertebrates.rn.maf.gz: 'Extended' multiple-genome alignment comprising 51 vertebrate genomes, with the sequence header for human sequences renamed to reflect the associated chromosome
- 51_vertebrates.rn.maf.gz: 'Default' multiple-genome alignment comprising 27 vertebrate genomes, with the sequence header for human sequences renamed to reflect the associated chromosome
03_species_tree_inference
- default_nonCDS_spptree.treefile: Species tree generated using UTR and intron alignments from the 'default' dataset
- extended_nonCDS_spptree.treefile: Species tree generated using UTR and intron alignments from the 'extended' dataset
04_neutral_phylogenetic_model
- 51_vertebrates_4d.mod: Neutral phylogenetic model for the 'extended' dataset
- 27_vertebrates_4d_anc.mod: Neutral phylogenetic model for the 'default' dataset, with ancestral branches labelled
05_time_tree
- max_tt.timetree.nex: Dated phylogenetic tree of taxa from the ‘extended’ dataset generated using maximum ages of divergence
- min_tt.timetree.nex: Dated phylogenetic tree of taxa from the ‘extended’ dataset generated using minimum ages of divergence
06_protein_family_analysis
- Pfam.renamed.fa.gz: Fasta file of sequences in the Pfam database
- /blasttabs.gz/: Pairwise alignments in blasttab format. Generated by aligning genomes to the Pfam database.
- pfam_counts.csv: Contains the number of sequences within each protein family for each species in the 'extended' dataset
- Sde2_all.fasta: All Ubi-N-Sde2 sequences from the 'extended' dataset
- Sde2_clusters_aln.fasta: Clustered and aligned sequences of Ubi-N-Sde2
- Sde2_clusters.treefile: Tree generated using clustered and aligned sequences of Ubi-N-Sde2
- UBC.aln.fasta: UBC sequence alignment generated using MAFFT
- UBC_consensus.fasta: UBC sequence alignment generated using BAli-Phy
- UBB.aln.fasta: UBB sequence alignment generated using MAFFT
- UBB_consensus.fasta: UBB sequence alignment generated using BAli-Phy
- mammals_sde2_NT.aln.fasta: Sequence alignment of mammalian Ubi-N-Sde2 sequences
- mammals_sde2.treefile: Tree generated from mammalian Ubi-N-Sde2 sequences
Data/07_phyloacc/
- CDS_elem_lik.txt: PhyloAcc output using coding sequences. Maximum log-likelihood configurations of latent state Z under null, accelerated and full model, with Z=-1 (if the element is 'missing' in the branches of outgroup species), 0 (background), 1 (conserved), 2 (accelerated). Each row corresponds to an input element and each column a branch in the tree.
- CDS_rate_postZ_M2.txt: PhyloAcc output using coding sequences. Posterior median of conserved rate, accelerated rate, probability of gain and loss conservation, and posterior probability of being in each latent state on each branch for each element. *_3 indicates the posterior probability in the accelerated state.
- nonCDS_elem_lik.txt: PhyloAcc output using UTRs and introns. Maximum log-likelihood configurations of latent state Z under null, accelerated and full model, with Z=-1 (if the element is 'missing' in the branches of outgroup species), 0 (background), 1 (conserved), 2 (accelerated). Each row corresponds to an input element and each column a branch in the tree.
- nonCDS_rate_postZ_M2.txt: PhyloAcc output using UTRs and introns. Posterior median of conserved rate, accelerated rate, probability of gain and loss conservation, and posterior probability of being in each latent state on each branch for each element. *_3 indicates the posterior probability in the accelerated state.
08_positive_selection
- 27_vertebrates_4d_foreground.mod: Phylogenetic tree used as input to PAML, with foreground branches labelled.
- cdsconc.cf.tree: Tree with concordance factors generated using coding sequences from the 'default' dataset
- nceconc.cf.tree: Tree with concordance factors generated using UTRs and introns from the 'default' dataset
Code/Software
Scripts/
| - 01_genome_alignment/
| | - 01_process_and_align_genomes.txt
| - 02_extract_and_align_CDS_and_nonCDS/
| | - 01_sort_alignment_and_annotation.txt
| | - 02_run_cds_noncds_extraction.txt
| | - 03_extract_noncds.txt
| | - 04_extract_cds.txt
| | - 05_get_multifa_coordinates.txt
| | - 06_find_duplicate_fastas.Rmd
| - 03_species_tree_inference/
| | - 01_generate_species_trees.txt
| | - 02_root_species_trees.Rmd
| - 04_neutral_phylogenetic_model/
| | - 01_phastcons_default.txt
| | - 02_phastcons_extended.txt
| - 05_time_tree/
| | - 01_generate_time_trees.txt
| | - 02_max_dates.txt
| | - 03_min_dates.txt
| - 06_protein_family_analysis/
| | - 01_align_pfam.txt
| | - 02_filter_pfam_alignments.txt
| | - 03_pfam_analysis.Rmd
| | - 04_get_sde2.txt
| | - 05_sde2_alignment_and_tree.txt
| | - 06_find_sde2_orthologs.txt
| | - 07_align_orthologous_genes.txt
| | - 08_align_mammalian_sde2.txt
| | - 09_run_paml.txt
| | - 10_codeml_M0_mammals.ctl
| | - 11_codeml_M1_mammals.ctl
| | - 12_codeml_M2_mammals.ctl
| | - 13_analyse_paml.Rmd
| - 07_phyloacc/
| | - 01_prep_phyloacc_input.txt
| | - 02_run_phyloacc.txt
| | - 03_phyloacc_parameters.txt
| | - 04_phyloacc_analysis.Rmd
| | - 05_get_phyloacc_genes.txt
| - 08_positive_selection/
| | - 01_prep_paml.txt
| | - 02_run_paml.txt
| | - 03_codeml_M0.ctl
| | - 04_codeml_M1.ctl
| | - 05_codeml_M2.ctl
| | - 06_codeml_bs_null.ctl
| | - 07_codeml_bs_full.ctl
| | - 08_get_paml_results.txt
| | - 09_paml_analysis.Rmd
| | - 10_baliphy.txt
| | - 11_get_concordance_factors.txt
01_genome_alignment
- 01_process_and_align_genomes.txt: Processes and aligns genomes to create a multi-genome alignment
02_extract_and_align_CDS_and_nonCDS
- 01_sort_alignment_and_annotation.txt: Renames the human sequence headers in the multi-genome alignment and human annotation file to include the chromosome number, while also creating two new annotation files that define the location of coding sequences and UTRs and introns in the human genome. It additionally splits the annotation files and alignment by chromosome, which is required by certain software such as PHAST.
- 02_run_cds_noncds_extraction.txt: Runs maffilter to extract CDS, UTRs and introns from multi-genome alignment
- 03_extract_noncds.txt: Maffilter parameter file to extract UTRs and introns
- 04_extract_cds.txt: Maffilter parameter file to extract CDS
- 05_get_multifa_coordinates.txt: obtains the coordinates for fastas corresponding to CDS, UTR and intron alignments
- 06_find_duplicate_fastas.Rmd: R notebook for finding and condensing duplicate alignment files
- 07_align_multifas_via_mafft.txt: Uses MAFFT to realign sequences
03_species_tree_inference
- 01_generate_species_trees.txt: Generates concatenated alignments for UTRs and introns from both the 'default' and 'extended' datasets. Subsequently generates a species tree for both datasets using IQ-Tree. Note that both trees are rooted using an external script in R (02_root_species_trees.Rmd).
- 02_root_species_trees.Rmd: R notebook used to root species trees at the node connecting Chondrichthyes and Osteichthyes.
04_neutral_phylogenetic_model
- 01_phastcons_default.txt: Estimates a neutral phylogenetic model using fourfold degenerate sites from the 'default' multi-genome alignment, in addition to labelling ancestral branches.
- 02_phastcons_extended.txt: Estimates a neutral phylogenetic model using fourfold degenerate sites from the 'extended' multi-genome alignment.
05_time_tree
- 01_generate_time_trees.txt: Creates time trees based on the maximum and minimum ages of divergence
- 02_max_dates.txt: Date file of maximum ages of divergence
- 03_min_dates.txt: Date file of minimum ages of divergence
06_protein_family_analysis
- 01_align_pfam.txt: Processes genomes and aligns them to the Pfam database, creating pairwise alignments in blasttab format
- 02_filter_pfam_alignments.txt: Filters blasttabs by removing sequence alignments with an E-value >= 1e-10, and reformats columns
- 03_pfam_analysis.Rmd: R notebook to filter the pfam alignment output and generate count tables outlining the number of sequences within a specific protein family for all 51 vertebrates in the dataset. Additionally analyses the relationship between viviparity and protein family size using pglmms and bayesian methods. It also generates genomic coordinates for sequences within a family of interest.
- 04_get_sde2.txt: Obtains sequences within the Ubi-N-Sde2 family for all species
- 05_sde2_alignment_and_tree.txt: Generates a tree using Ubi-N-Sde2 sequences from all vertebrates in the 'extended' dataset, which were first clustered and aligned using MAFFT
- 06_find_sde2_orthologs.txt: Obtains genes containing Ubi-N-Sde2 sequences in mammals and identifies their orthologs in other mammalian genomes
- 07_align_orthologous_genes.txt: Generates whole-gene alignments for UBC and UBB using BAli-Phy and MAFFT
- 08_align_mammalian_sde2.txt: Aligns Ubi-N-Sde2 sequences in mammals using MACSE
- 09_run_paml.txt: Calls each codeml file to run PAML
- 10_codeml_M0_mammals.ctl: Control file for running model M0 in codeml
- 11_codeml_M1_mammals.ctl: Control file for running model M1 in codeml
- 12_codeml_M2_mammals.ctl: Control file for running model M2 in codeml
- 13_analyse_paml.Rmd: Constructs likelihood ratio tests for paml output
07_phyloacc
- 01_prep_phyloacc_input.txt: creates concatenated fasta and partition file required by PhyloAcc
- 02_run_phyloacc.txt: executes PhyloAcc
- 03_phyloacc_parameters.txt: parameter file for PhyloAcc
- 04_phyloacc_analysis.Rmd: R notebook for the analysis of PhyloAcc output. Identifies and plots accelerated elements
- 05_get_phyloacc_genes.txt: obtains the name of genes containing viviparous accelerated elements
08_positive_selection
- 01_prep_paml.txt: Prepares input for PAML by first splitting coding alignments into 'reliable' and 'less reliable' sequences using MACSE. Then refines and exports the existing alignment for compatibility with PAML.
- 02_run_paml.txt: Executes to codeml files to run PAML on coding sequence alignments
- 03_codeml_M0.ctl: Control file for running model M0 in codeml
- 04_codeml_M1.ctl: Control file for running model M1 in codeml
- 05_codeml_M2.ctl: Control file for running model M2 in codeml
- 06_codeml_bs_null.ctl: Control file for running the null branch-site model in codeml
- 07_codeml_bs_full.ctl: Control file for running the full branch-site model in codeml
- 08_get_paml_results.txt: Obtains and concatenates lnL values
- 09_paml_analysis.Rmd: Constructs likelihood ratio tests for paml output
- 10_baliphy.txt: Tests for evidence of positive selection on protein-coding alignments using the branch-site model in BAli-Phy
- 11_get_concordance_factors.txt: Obtains site and gene concordance factors for the 'defaul't dataset using both coding sequences, UTRs and introns.
We used newly sequenced and publicly available whole genome data to generate multi-genome alignments that allowed us to make phylogenetic, genomic, and proteomic comparisons between viviparous and oviparous species. Specifically, we extracted coding sequences, UTRs and introns from the multi-genome alignments, before re-aligning them to generate both species and time trees. We additionally extracted fourfold degenerate (4d) sites from the genome alignments to generate a neutral phylogenetic model, which was used to analyse differences in substitution rates. Finally, we aligned genomes to sequences of protein families to obtain data corresponding to protein family sizes.