Data for: The pace of mitochondrial molecular evolution varies with seasonal migration distance
Data files
Nov 17, 2023 version files 276.23 KB
-
Additional_files_for_figure_generation.zip
-
Coevol_inputs.zip
-
migration-molecrates-main.zip
-
README.md
Abstract
This repository contains data associated with "The pace of mitochondrial molecular evolution varies with seasonal migration distance." This study demonstrates relationships between traits (migration distance, mass, population genetic parameters representing genetic diversity) and molecular evolutionary rates (dS and dN/dS). The main conclusions are based on models from the program Coevol. The files included here are input files necessary to run models with Coevol, as well as selected Coevol output files used for figure generation associated with the manuscript. We also include a copy of a GitHub repository of manuscript code. Other necessary data for replicating analyses in the manuscript are included in the manuscript supplement.
README: Data for "The pace of mitochondrial molecular evolution varies with seasonal migration distance"
https://doi.org/10.5061/dryad.cvdncjt99
README written by Teresa M. Pegan
Description of the data and file structure
This dataset contains three folders of data and code associated with the article "The pace of mitochondrial molecular evolution varies with seasonal migration distance."
Methodological details
This paper conducted three sets of analyses on a community of 39 boreal-breeding small-bodied bird species.
- We used the program Coevol was used to assess correlation between life history traits and molecular evolutionary rates in a Bayesian phylogenetic framework. Coevol requires sequence data from a single representative per species; a dataset containing trait information for each species; and a phylogenetic tree. The associated manuscript describes how we generated each Coevol input in detail, but in brief: -- Sequence data from complete mitochondrial coding sequences were assembled from low-coverage whole genome sequences using the program NOVOPlasty v4.3.1. -- Trait data come from previously-published resources\, with the exception of population genetic summary statistics\, described in the next section. -- The phylogenetic tree was downloaded from birdtree.org Complete input files and code needed to run Coevol models are provided in this repository, as described below.
- We estimated population genomic summary statistics using complete mitochondrial coding sequences from a subset (30) of the study species. The summary statistics θ and πS were used as traits in Coevol models. The summary statistic πN/πS was used in linear modeling analyses described in the next section. Population-level datasets of mitochondrial coding sequences are hosted on GenBank and are not included here. The accession number from each mitochondrial coding sequence used in analysis is provided in Table S2 of the manuscript supplement. Code needed to generate these estimates is provided in this repository, as described below.
- We used linear modeling to test for correlations between θ, πN/πS, and migration distance (one of the traits also used in Coevol models). Migration distance and θ values are provided in the Coevol input files contained here, as well as in Table S1 of the manuscript supplement. πN/πS estimates are provided in Table S1 of the manuscript supplement. Code needed to run these linear models is provided in this repository, as described below.
Data and code contained in this repository
Folder 1: "migration-molecrates-main"
This folder contains code used to prepare data and conduct analyses described in the article. The folder is a copy of a GitHub repository. Detailed contents:
--File: README.md #### The readme file associated with the GitHub repository.
--Subfolder: Assemble_mitochondrial_sequences_from_short_reads
----File: 1.trim_short_reads.md #### code for trimming short read fastq files\, raw DNA data from sequencer
----File: 2.Run_NOVOPlasty.md #### code for assembling mitochondrial contigs from short read fastq files
----File: 3.clean_novoplasty_fastas.R #### R script for renaming mitochondrial contigs
----File: 4.clean_Geneious_output.R #### R script for cleaning up fasta files produced by Geneious\, as described in the manuscript. Geneious steps are performed interactively with a GUI. In brief\, we loaded mitochondrial contigs into Geneious and used known mitochondrial gene sequences to annotate each coding gene\, which we then exported as a fasta file. The exported fasta files can then be processed by 4.clean_Geneious_output.R. All resulting mitochondrial gene sequences are available on GenBank and their accession numbers are provided in Table S1 contained in the folder "Pegan_et_al_Bird_migration_and_molecular_evolution_supplement\," described below.
--Subfolder: Coevol_analysis
----File: Coevol_commands.md #### code for running Coevol models. All data necessary to run Coevol models are contained in the folder "Coevol inputs\," described below.
--Subfolder: Figure_preparation
----File: figure2_figure3_figureS2_Coevol.R #### R script for generating figures in the R script. Data necessary to create these figures are contained in the folders "Additional files for figure generation" and "Pegan_et_al_Bird_migration_and_molecular_evolution_supplement\," described below.
----File: figure4_linear_models.R #### R script for generating figures in the R script. This script must be run after the file Linear_modeling/Linear_modeling.R\, described below.
----File: README.md #### some additional notes on the relationship between these scripts and the figures in the dataset.
--Subfolder: Linear_modeling
----File: Linear_modeling.R #### R script for performing linear model analyses in the manuscript. All data necessary to run this script are contained in the folders "Coevol inputs" and "Pegan_et_al_Bird_migration_and_molecular_evolution_supplement\," described below.
--Subfolder: Summary_stats
----File: 1.estimate_piS_and_piNpiS.py #### Python script that calculates piS and piN from fasta files of mitochondrial gene sequences. All mitochondrial gene sequences are available on GenBank and their accession numbers are provided in Table S1 contained in the folder "Pegan_et_al_Bird_migration_and_molecular_evolution_supplement\," described below.
----File: 2.TiTv_ratio.R #### R script that calculates the transition/transversion ratio for each species using fasta files of mitochondrial gene sequences. All mitochondrial gene sequences are available on GenBank and their accession numbers are provided in Table S1 contained in the folder "Pegan_et_al_Bird_migration_and_molecular_evolution_supplement\," described below.
----File: 3.LAMARC_Thetas.md #### Code and notes on interaction with the LAMARC program\, which is a GUI. These steps involve estimating the genetic diversity parameter theta from mitochondrial gene sequence data. All mitochondrial gene sequences are available on GenBank and their accession numbers are provided in Table S1 contained in the folder "Pegan_et_al_Bird_migration_and_molecular_evolution_supplement\," described below.
--Subfolder: Supplement
----File: theta_piS_correlation.R #### R script providing code for a supplemental analysis of the relationship between theta and piS\, two different estimators of genetic diversity. Data for this analysis are provided in Table S1 contained in the folder "Pegan_et_al_Bird_migration_and_molecular_evolution_supplement\," described below.
----File: Tree_topology_file_prep.R #### R script showing code for a supplemental analysis that involved sampling marginal trees from a set of trees downloaded from birdtree.org
Folder 2: "Coevol inputs"
This folder contains input files for Coevol models using different subsets of data and combinations of traits. Coevol input files are labeled based on which subsets of data and parameters were used. Each Coevol model requires a set of aligned sequence data in the phylip (.phy) format; a set of trait data in text (.txt) format; and a phylogenetic tree (.tre). The following descriptions of files are organized by the name of the data/parameter subset.
--Files with "fullmt" in their filename
These files are input for a Coevol model that uses the full dataset of 39 species and does not include proxies of Ne as predictors
----File: fullmt.phy #### Sequence data in phylip format with one full mitochondrial coding sequence for each of 39 species
----File: fullmt.tre #### Phylogenetic tree with branches for each of 39 species
----File: fullmt.txt #### Trait data for the 39 species used in this model provided in the necessary format required by Coevol. The traits included here are species' mean mass (in grams) and estimated migration distance (in km). Mass was derived from publicly-available resources (Dunning Bird Mass book\, Birds of the World online) and migration distance was estimated using geographic range centroids from BirdLife International. Detailed methods for deriving these estimates from those data sources are provided in the manuscript associated with this repository.
-- Files with "NeRev" in their filename
These files are input for a Coevol model that uses the reduced subset of 30 species with available estimates of theta from LAMARC. Theta is used as a proxy for effective population size (Ne). The estimates of theta were revised during manuscript revision, hence the name "NeRev."
----File: NeRev.phy #### Sequence data in phylip format with one full mitochondrial coding sequence for each of 30 species
----File: NeRev.tre #### Phylogenetic tree with branches for each of 30 species
----File: NeRev.txt #### Trait data for the 30 species used in this model provided in the necessary format required by Coevol. The traits included here are species' mean mass (in grams)\, estimated migration distance (in km)\, and theta\, which is an estimator for the average number of differences per base among DNA sequences within a population. Code for estimating theta from mitochondrial sequences using LAMARC is provided in this repository in the folder migration-molecrates-main/Summary_stats/3.LAMARC_Thetas.md\, above. See notes on other traits under "Files with "fullmt" in their filename\," above.
--Files with "NePiS" in their filename
These files are input for a Coevol model that uses the reduced subset of 30 species with population genetic data, but uses piS as a proxy for Ne instead of theta.
----File: NePiS.phy #### Sequence data in phylip format with one full mitochondrial coding sequence for each of 30 species
----File: NePiS.tre #### Phylogenetic tree with branches for each of 30 species
----File: NePiS.txt #### Trait data for the 30 species used in this model provided in the necessary format required by Coevol. The traits included here are species' mean mass (in grams)\, estimated migration distance (in km)\, and piS\, which is the average number of bases that differ between pairs of individuals in a population\, considering only synonymous (evolutionarily neutral) polymorphisms. Code for estimating piS from mitochondrial sequences using python is provided in this repository in the folder migration-molecrates-main/Summary_stats/1.estimate_piS_and_piNpiS.py. See notes on other traits under "Files with "fullmt" in their filename\," above.
--Additional tree files (tree1.tre to tree10.tre)
We performed supplemental Coevol analyses testing for effects of tree topology variation by re-running the "fullmt" Coevol model with a set of 10 marginal phylogenetic subsampled from a tree set downloaded from birdtree.org. Code for subsampling trees tree1.tre through tree10.tree is provided in the folder migration-molecrates-main/Supplement/Tree_topology_file_prep.R. The supplemental analysis with each of these marginal tree topologies used the fullmt.txt and fullmt.phy input files.
Folder 3: "Additional files for figure generation"
To facilitate easy reproduction of the figures in the associated manuscript, we provide two Coevol output files here. All other output files can be re-generated using the data and code in this repository.
----File: fullmt1.postmeansynrate.tre #### contains information about posterior mean estimated synonymous substitution rate across the 39 species in the study. File is provided in .tre format by Coevol. Relevant values are parsed from tip labels. Code in folder migration-molecrates-main can be used to generate the manuscript's Figure 2 and Figure S2 using these data.
----File: NeRevdsom1.postmeanomega.tre #### contains information about posterior mean estimate dN/dS across the subset of 30 species for which we have population genetic summary statistics. File is provided in .tre format by Coevol. Relevant values are parsed from tip labels. Code in migration-molecrates-main can be used to generate the manuscripts Figure 3 using these data.
Folder 4: "Pegan_et_al_Bird_migration_and_molecular_evolution_supplement"
This folder contains the complete supplementary material for the associated manuscript.
----File: Supplementary material.pdf #### a pdf of Figures S1 and S2\, Table S3\, and all supplementary table captions
----File: Table_S1.csv #### metadata associated with each species analyzed in the study including species\, taxonomic family\, the specimen catalog number of the sample used in Coevol analyses\, the GenBank accession number of the NOVOPlasty seed sequence used to assemble mitochondrial contigs (see migration-molecrates-main/Assemble_mitochondrial_sequences_from_short_reads/Run_NOVOPlasty.md)\, the GenBank accession number of mitochondrial coding sequence datasets used to annotate genes with Geneious (see migration-molecrates-main/Assemble_mitochondrial_sequences_from_short_reads/4.clean_Geneious_output.R)\, body mass in grams\, representing species-level average estimates from publicly available datasets (Dunning Bird Mass book\, Birds of the World online resource)\, migration distance in km estimated as the distance between breeding and nonbreeding ranges from BirdLife International range maps\, transition/transversion ratio as calculated in migration-molecrates-main/Summary_stats/2.TiTv_ratio.R\, theta as estimated in migration-molecrates-main/Summary_stats/3.LAMARC_Thetas.md\, the number of individuals in the population dataset\, estimates of dS (and its upper and lower credible intervals) and dN/dS (and its upper and lower credible intervals) generated by Coevol models (see contents of migration-molecrates-main/Coevol_analysis)\, and estimates of piN/piS and piS estimated in migration-molecrates-main/Summary_stats/1.estimate_piS_and_piNpiS.py.
----File: Table_S2.csv #### metadata associated with each individual sample analyzed in the study including species\, the institution hosting the voucher specimen of the sample (see table caption in Supplementary material.pdf for acronym definitions)\, the institutional catalog number of the specimen\, the sex of the specimen\, the date of collection\, the location where the specimen was collected (state or province in the US or Canada and also the latitude/longitudinal coordinates)\, a binary variable indicating whether the variable was retained in analysis (vs filtered out)\, and the GenBank accession number for each of the 13 mitochondrial genes belonging to the specimen. Only retained samples (Retained=TRUE) have accession numbers.
----File: Table_S4.csv #### Full results of Coevol models using the "dsdn" option\, which creates estimates of dS and dN. Each row is a flattened matrix of results. Each model was run twice and the "Replicated" column indicates the replicate from which the matrix was taken. Full details about the format of Coevol output matrices can be found in the program's documentation at https://github.com/bayesiancook/coevol/blob/master/coevol1.6.pdf. The first 12 rows (excluding header) show results from models using the "fullmt" input files (see notes under folder "Coevol inputs"\, above). The latter 12 rows show results from models using the "NeRev" input files (see notes under folder "Coevol inputs"\, above).
----File: Table_S5.csv #### Full results of Coevol models using the "dsom" option\, which creates estimates of dS and dN/dS (also called "omega"). Each row is a flattened matrix of results. Each model was run twice and the "Replicated" column indicates the replicate from which the matrix was taken. Full details about the format of Coevol output matrices can be found in the program's documentation at https://github.com/bayesiancook/coevol/blob/master/coevol1.6.pdf. The first 12 rows (excluding header) show results from models using the "fullmt" input files (see notes under folder "Coevol inputs"\, above). The latter 12 rows show results from models using the "NeRev" input files (see notes under folder "Coevol inputs"\, above).
----File: Table_S6.csv #### Full results of Coevol models for supplemental analysis that uses piS as a proxy for Ne rather than theta. Each model was run twice and the "Replicated" column indicates the replicate from which the matrix was taken. Full details about the format of Coevol output matrices can be found in the program's documentation at https://github.com/bayesiancook/coevol/blob/master/coevol1.6.pdf. All results come from models using the "NePiS" input files (see notes under folder "Coevol inputs"\, above). The first 12 rows (excluding header) show results from models using the "dsdn" option and the latter 12 rows show results from models using the "dsom" option.
Sharing/Access information
All mitochondrial gene sequence data analyzed in this manuscript can be found on GenBank. Accession numbers are summarized in the manuscript's Table S2.
Methods
Coevol input files include sequence data, phylogenetic tree files, and text files of traits. Detailed methods about the collection and processing of the dataset are included in the manuscript and in the folder of code ("migration-molecrates-main"). Briefly, mitochondrial gene sequences were generated from samples that we sequenced with low-coverage genome sequencing; phylogenetic trees are from birdtree.org; trait data include per-species body mass estimates from publicly available sources and migration distance estimates from species' range maps; and trait data also include population genetic parameters estimated using population-level datasets of mitochondrial gene sequences (not included here). All mitochondrial gene sequences, including those for population-level analyses, are available on GenBank as described in Table S2 of this manuscript's supplement section.