Data from: Evolutionary and phylogenetic insights from a nuclear genome sequence of the extinct, giant subfossil koala lemur Megaladapis edwardsi
Data files
Oct 21, 2020 version files 8.82 GB
-
Mega_alns.zip
-
Mega_dNdS.zip
-
Mega_GOconv.zip
-
Mega_phylo.zip
-
README_Oct2020.txt
Jun 10, 2021 version files 8.82 GB
-
Mega_alns.zip
-
Mega_dNdS.zip
-
Mega_GOconv.zip
-
Mega_phylo.zip
-
README_May2021.txt
-
TableS5_0505.txt
Abstract
No endemic Madagascar animal with body mass >10 kg survived a relatively recent wave of extinction on the island. From morphological and isotopic analyses of skeletal ‘subfossil’ remains we can reconstruct some of the biology and behavioral ecology of giant lemurs (primates; up to ~160 kg), elephant birds (up to ~860 kg), and other extraordinary Malagasy megafauna that survived well into the past millennium. Yet much about the evolutionary biology of these now extinct species remains unknown, along with persistent phylogenetic uncertainty in some cases. Thankfully, despite the challenges of DNA preservation in tropical and sub-tropical environments, technical advances have enabled the recovery of ancient DNA from some Malagasy subfossil specimens. Here we present a nuclear genome sequence (~2X coverage) for one of the largest extinct lemurs, the koala lemur Megaladapis edwardsi (~85kg). To support the testing of key phylogenetic and evolutionary hypotheses we also generated new high-coverage complete nuclear genomes for two extant lemur species, Eulemur rufifrons and Lepilemur mustelinus, and we aligned these sequences with previously published genomes for three other extant lemur species and 47 non-lemur vertebrates. Our phylogenetic results confirm that Megaladapis is most closely related to the extant Lemuridae (typified in our analysis by E. rufifrons) to the exclusion of L. mustelinus, which contradicts morphology-based phylogenies. Our evolutionary analyses identified significant convergent evolution between M. edwardsi and extant folivorous primates (colobine monkeys) and ungulate herbivores (horses) in genes encoding protein products that function in the biodegradation of plant toxins and nutrient absorption. These results suggest that koala lemurs were highly adapted to a leaf-based diet, which may also explain their convergent craniodental morphology with the small-bodied folivore Lepilemur.
Usage notes
1. Exon alignments
Megaladapis edwardsi 2X coverage exon mapped data (to the hard-masked hg19 FASTA) was generated in two ways: 1) damage-masked where all M. edwardsi sequence reads potentially affected by cytosine deamination (5’ T residues and 3’ A residues) within 9nt of fragment ends were hard-masked (i.e., replaced with ‘N’) sites, and 2) no damage-masking, where all such sites are left in. Both types of data sets are included in this folder.
For M. edwardsi, positional pileup files (mpileup; “*.masked9nt.pileup” and “*.pileup”) are included (generated from the compiled lastZ alignment output, “UA5180.remap.whole.lastz.txt”) using SAMtools pileup (default settings). For extant lemurs (D. madagascariensis, E. rufifrons, L. mustelinus, P. diadema) and colobine (R. roxellana), BAM files for each as mapped to the hard-masked hg19 FASTA file are included (*.bam), as well as the subsequent mpileup files (*remap.noMask.q1.mpileup) generated the same way as M. edwardsi.
The damage-masked and damage unmasked 2X coverage data sets are the summarized exon sequences matching the hg19 “known canonical” gene set excluding masked sites, spliced exons to match the full transcripts, and added our M. edwardsi, extant lemurs, and colobine monkey gene sequences with the remaining 46 sequences (UCSC vertebrate alignment), resulting in a 53-way multi species-alignment that we used in our analyses (Mega2XMasked.tar.gz, Mega2XUnmasked.tar.gz).
- Mega_alns.zip
bams: Contains extant lemurs and colobine BAM files (*.bam) mapped to the hard-masked human exon reference, M. edwardsi compiled lastZ alignment (*lastz.txt), and hg19 exon reference (*.fasta)
mpileups: Contains positional pileup files for M. edwardsi (“*.masked9nt.pileup” and “*.pileup”) and extant lemurs and colobine (*remap.noMask.q1.mpileup)
exon: Contains 2X coverage summarized exon sequence data sets Mega2XMasked.tar.gz and Mega2XUnmasked.tar.gz
2. Phylogenetic analysis
This folder contains the input and output files for the maximum likelihood phylogenetic analyses for estimating the placement of M. edwardsi. A single unrooted phylogeny was generated from a concatenated gene alignment of 15 primate species (2X damage-masked, n=896 loci) using RAxML without partitioning (GTR GAMMA model). Independent phylogenies (“gene trees”) were estimated across sequences with at least 20% of sites covered (n=12,809) using the same GTR GAMMA model with 100 bootstrap replicates for each gene tree. Internode certainty (IC) was also calculated across the concatenated gene tree among bootstrap consensus gene trees with at least 90% mean bootstrap consensus support.
- Mega_phylo.zip
The folder is organized into the following:
alns: Contains .phy files for all gene alignments used in gene tree phylogeny estimation (Megaladapis represented at >= 20% of sites).
geneTrees: Contains complete RAxML output for all gene tree estimation. Run logs and starting commands are found in RAxML_info.*
ConcatRaxml: Contains the concatenated alignment of 896 genes used for concatenated phylogeny estimation where Megaladapis is represented at >= 50% of sites, and RAxML output from phylogeny estimation.
IC: Contain RAxML output from IC/TC calculation using gene trees.
3. dN/dS ratio analysis
We used the codeml function of PAML (Yang, 2007) to estimate lineage-specific dN/dS ratios across the phylogeny of the six lemurs in our study plus three non-lemur primates (Homo sapiens, Gorilla gorilla, and Pan troglodytes). In our damage-masked 2X sequence coverage data, we restricted analysis to the set of n=3,342 genes with i) ≥100 intact codons, ii) ≥100 N sites present and aligned across all nine species in this analysis, and iii) Megaladapis lineage dS values not more than 2 s.d. greater than the genome-wide average dS value. dN/dS ratios were calculated in two ways: first based on the synonymous substitution rate for an individual gene (dN/dS), and second, a dN/dSgenome ratio based on the genome-wide estimate of dS. The genome-wide estimate is calculated from the total number of synonymous substitutions across all genes divided by the total number of synonymous sites genome-wide.
The data consist of aligned, masked set of sequences in .nuc format for 7,503 loci from the nine primate species for which there were more than 100 comparable codons. For parallelization, the 7,503 loci that met these criteria were processed using the TORQUE package managers scripts, the template specifications for which is found in the file named "qsubheader.qsub".
Supplementary tables from the functional enrichment analysis using g:Profiler (https://biit.cs.ut.ee/gprofiler/gost) to identify any functional category enrichment among the set of genes with Megaladapis edwardsi lineage dN/dSgenome values >1.5, using all 3,342 genes are also included (“gProfiler_tables.xlsx”).
- Mega_dNdS.zip
PAML_seqs_and_outputs.tar.gz: Sequence files are distributed into distributed into 123 directories within this folder, each containing 61 aligned sequences in a folder entitled "nuc.###/" and the outputs of the PAML analyses in the folder entitled "out/" (NOTE: the numbers after the "." in "nuc.###" do not correspond to the number of their parent folder; this is a product of the loop that generated them, and should be ignored). In the "1/" subfolder, "sult1c2/" and "GHR_free/" contain the PAML analyses of these loci of interest; the output for the 61 genes used in the broader analysis is in this folder contained in a directory entitled "saveout/".
In addition to the sequence files, the nested structure used for the analyses have been preserved with the requisite files necessary for running PAML. Though this means that there are a great number of duplicates (all directories intitled "dat/", for instance), this should enable you to recreate the analyses. For simpler access to the input sequences and output data, we have also provided a structure-less version of both datasets. The daughter files of the PAML analyses have been gathered in a top-level directory within this folder, entitled "Output_dnDS/"; the original masked .nuc formatted sequences are in the directory entitled "seqs/".
This folder contains the final output file (PAML_Final_dNdS_100.txt) used in the analyses. The columns are as follows: “Genes” are the UCSC gene names; "branch" indicates the species compared in numeric format based on the tree.tree; "t" is the number of nucleotide substitutions per codon; "N" is the number of non-synonymous sites; "S" is the number of synonymous sites; "dN/dS" is the ratio of non-synonymous to synonymous substitutions among sites; "dN" is the number of non-synonymous substitutions per non-synonymous site; "dS" is the number of synonymous substitutions per synonymous site; "N*dN" is the genome-wide rate of non-synonymous substitutions across all genes; "N*dS" is the genome-wide rate of synonymous substitutions across all genes.
Also in this folder is "gProfiler_tables.xlsx": contains the output tables from the functional enrichment analysis using g:Profiler. The columns are: “source” is the GO database queried; “term_name” is the functional GO term category; “term_id” is the GO term identifier; “adjusted_p_value” is false discovery value for comparisons; “negative_log10_of_adjusted_p_value” is the enrichment score calculated from the false discovery rate; “term_size” is the number of genes in the GO term category; “query_size” is the number of genes queried in the test data; “intersection_size” lists the number of genes that intersect the query and functional category; “effective_domain_size” is the total number of genes used; “intersections” lists the genes that intersect the query and functional category. Separate tabs of the results are included for M. edwardsi, E. rufifrons, L. mustelinus, M. murinus, P. diadema, and D. madagascariensis.
4. Genomic convergence analysis
The multi-species (n=53) gene sequence alignments (for n=21,520 genes) was translated into amino acid sequences. We then queried each possible comparison with M. edwardsi (using ETE3 to navigate the tree), recording the numbers of analyzable amino acid sites and convergent amino acids for each gene.
This folder contains the output files from the genomic convergence analysis: “MegaALL_no_threshold.csv” contains all unfiltered comparisons of species with M. edwardsi (no FDR or minimum convergence thresholds) and “BH_corr_threshold.xlsx” contains all species comparisons filtered by FDR <0.05 and a minimum of 5 convergence events.
- Mega_GOconv.zip
Mega_GOconv: contains two files, 1) “BH_corr_comparisons.xlsx” (filtered for FDR and 5 convergence events) contains the convergence data split into three separate tabs: “Mega_micMur” representing comparisons with M. murinus as the outgroup, “Mega_Pdiadema” representing comparisons with P. diadema as the outgroup, and “All_other_comps” representing comparisons with all other species besides M. edwardsi. The columns in each tab are as follows: “Species1(or group1),OUT1,OUT1” represents M. edwardsi, E. rufifrons sister species to M. edwardsi (OUT1) and a member of the outgroup clade (e.g., M. murinus or P. diadema); “Species2(or group2),OUT2,OUT2” represents the distance comparison species/clade with a sister species and outgroup species (or randomly selected members of those clades); “GO term” is the functional GO category; “Converged in GO term” is the total number of convergent and analyzable amino acids represented in the GO term; “#AA convergence possible in GO term” is the number of amino acids for which convergence is expected in the GO term; “#AA converged in genome” is the number of convergent sites identified genome-wide; “#AA for which convergence is possible” is the number of convergent amino acids expected genome-wide; “Odds ratio” is the strength of the association between the GO term and number of convergent sites; “P-val” is the probability of association between the GO term and number of convergent sites; “BH-corrected” is the false discovery rate, correcting for multiple tests; “Genes(AA number of convergence)” are the genes for which convergent amino acids are identified and the position of those amino acids in the gene.
The second file, 2) “MegaALL_no_threshold.csv” (all unfiltered comparisons with M. edwardsi) contains the columns: “Medwardsi” is the primary species comparison; “Outgroup1” is the sister species E. rufifrons; “Outgroup1$Conv.comp” is the outgroup clade (M. murinus or P. diadema) as well as the comparative distant species/clade; “Outgroup2” is the sister species or randomly selected species; “Outgroup2$Conv.path” is the outgroup species as well as the identified GO pathway; “Count” is the total number of convergent and analyzable amino acids in the GO term; “AA.conv.possible.GO” is the number of amino acids for which convergence is expected in the GO term; “AA.conv.genome” is the number of convergent sites identified genome-wide; “AA.conv.possible” is the number of convergent amino acids expected genome-wide; “Odds.ratio” is the strength of the association between the GO term and number of convergent sites; “Pval” is the probability of association between the GO term and number of convergent sites; “Genes(AA number of convergence)” are the genes for which convergent amino acids are identified and the position of those amino acids in the gene; “BH_corr” is the false discovery rate, correcting for multiple tests.
5. Post-hoc genomic convergence analysis
A post-hoc analysis was conducted that focused specifically on the hydrolase activity and brush border GO terms. The numbers of analyzable amino acid sites and convergent amino acids were counted for each possible species comparison across the entire tree (n=53) for a total of 1,855 comparisons that excluded Megaladapis.
5. TableS5_0505.txt
"TableS5_0505.txt" contains the post-hoc results with the following columns: "#Amino acids convergent in GO term" is s the total number of convergent and analyzable amino acids represented in the GO term; "#Amino acids convergent genome wide" is s the number of convergent sites identified genome-wide; "#Amino acid convergent sites possible in GO term" is the number of amino acids for which convergence is expected in the GO term; "#Amino acid convergent sites possible whole genome" is the number of convergent amino acids expected genome-wide; "#Amino acid convergent sites genome-wide except GO term" is the number of amino acids convergent genome-wide excluding those in the GO term (column 1 and column 2); "#Amino acid convergent sites whole genome possible except GO term" is the number of amino acids convergent in the whole genome excluding the GO term (column 3 and column 4); "Proportion of genome-wide convergent sites to genome-wide possible sites" is based on the counts from column 2 and column 4; "Proportion of convergent sites convergent in the GO term to sites possible in the GO term" is based on the counts from column 1 and column 3; "Odds.ratio" is the strength of the association between the GO term and number of convergent sites; "Pval" is the probability of association between the GO term and number of convergent sites; "Species1(or group1),OUT1,OUT1" and "Species2(or group2),OUT2,OUT2" represent the comparison between Species 1 and Species 2 with respective outgroups; "Fisher.pval" is the p-value from the Fisher's exact test; "GO pathway" is the data from the hydrolase activity or brush border species comparisons.