Emergence and radiation of distemper viruses in terrestrial and marine mammals - Input files, bash and R codes for analysing PDV and CDV sequence data
Stokholm, Iben et al. (2021), Emergence and radiation of distemper viruses in terrestrial and marine mammals - Input files, bash and R codes for analysing PDV and CDV sequence data, Dryad, Dataset, https://doi.org/10.5061/dryad.fxpnvx0sq
Canine distemper virus (CDV) and phocine distemper virus (PDV) are major pathogens to terrestrial and marine mammals. Yet little is known about the timing and geographical origin of distemper viruses and to what extent it was influenced by environmental change and human activities. To address this, we i) performed the first comprehensive time-calibrated phylogenetic analysis of the two distemper viruses; ii) mapped distemper antibody and virus detection data from marine mammals collected between 1972-2018; iii) and compiled historical reports on distemper dating back to the 18th century. We find that CDV and PDV diverged in the early 17th century. Modern CDV strains last shared a common ancestor in the 19th century with a marked radiation during the 1930s-50s. Modern PDV strains are of more recent origin, diverging in the 1970s-80s. Based on the compiled information on distemper distribution, the diverse host range of CDV and basal phylogenetic placement of terrestrial morbilliviruses, we hypothesize a terrestrial CDV-like ancestor giving rise to PDV in the North Atlantic. Moreover, given the estimated timing of distemper origin and radiation, we hypothesize a prominent role of environmental change such as the Little Ice Age, and human activities like globalisation and war in distemper virus evolution.
The data sets were created by compiling recently published H gene PDV sequences and CDV sequence data obtained from NCBI.
The first data (Alignment 1) set consists of 446 near-complete H gene sequences (25 PDV + 421 CDV sequences; 1,668 bp; position 7,199-8,866 in NC_001921) comprising the majority of distemper H gene sequences available in GenBank at the time of analysis (July 2020).
The second data set (Alignment 2) represents the full sequences used in the Bayesian phylogenetic analyses. The alignment consists of 125 full-length H gene sequences (25 PDV and 100 CDV; 1,812 bp; position 7,079-8,890 in NC_001921), representing major PDV and CDV clades in terrestrial and marine mammals detected between 1982-2018.
Both data sets were imported and edited in Geneious version 9.1.8, the alignments were generated using MUSCLE and sequences were edited to the same reading frame and length excluding stop codons. Sequences obtained through the studies Stokholm et al, 2019 and Puryear et al, in review, have not been made publicly available yet but sequences have been submitted to genbank and will be released under the accession numbers OK104948-91 and MW581015-26.
The third data set (Alignment 3) was used for the final Bayesian phylogenetic analyses. It consists of 125 Hemagglutinin gene sequences (25 PDV sequences and 100 CDV sequences) without the 3rd codon positions (1,208 bp). The 3rd codon positions were removed due to the detection of substitution saturation.
The bash codes included here were used for the following tasks:
- Edit the input sequence names in the fasta file.
- Edit the fasta files for excluding the third codon positions.
- Submit many slurm jobs in parallel on a remote server, that can run BEAST v.2.6.3 in parallel for multiple xml files at once.
- Edit many xml files created in 'BEAuti' in order to be able to perform stepping stone (or pathsampling) analysis in order to obtain marginal likelihood values for each analysis of each xml file.
- Collect the marginal likelihood values generated from each individual analysis.
- Start multiple slurm jobs in 'treeanotator' in parallel to get consensus tress.
- Collect all consensus trees.
The R code included here was used to generate Supplementary Figure 3 "Rcode_for_making_Supplementary Figure_3_2021mar.R" based on "marginal_lkhoods_tmp08.txt". The code can be run in R v.4.0.2. It is used to plot a diagram that shows a summary of the tree age estimates and HPD 95 % intervals of CDV, PDV and CDV/PDV sequences of BEAST analyses using different setups. The tree ages are obtained from the consensus trees generated through the BEAST analysis of each xml file.
Please note that all these pieces of code have been setup to run on a specific remote server. They will need to be adjusted in order to be able to run elsewhere.
Velux Fonden, Award: 123012
Innovationsfonden, Award: 6180-00001B and 6180-00002B
Forschungszentrum Jülich GmbH, German Federal Ministry of Education and Research, Award: FKZ 03F0767A
Academy of Finland, Award: 311966
Stiftelsen för Miljöstrategisk Forskning