The species coalescent indicates possible bat and pangolin origins of the COVID-19 pandemic
Data files
Apr 26, 2023 version files 1.44 GB
Abstract
A consensus species tree is reconstructed from 11 gene trees for human, bat, and pangolin beta coronaviruses from samples taken early in the pandemic (prior to April 1, 2020). Using coalescent theory, the shallow (short branches relative to the hosts) consensus species tree provides evidence of recent gene flow events between bat and pangolin beta coronaviruses predating the zoonotic transfer to humans. The consensus species tree was also used to reconstruct the ancestral sequence of human SARS-CoV-2, which was 2 nucleotides different from the Wuhan sequence. The time to most recent common ancestor was estimated to be Dec 8, 2019, with a bat origin. Some human, bat, and pangolin coronavirus lineages found in China are phylogenetically distinct, a rare example of a class II phylogeography pattern (Avise et al. in Ann Rev Eco Syst 18:489–422, 1987). The consensus species tree is a product of evolutionary factors, providing evidence of repeated zoonotic transfers between bat and pangolin as a reservoir for future zoonotic transfers to humans.
Methods
Archived closely related Beta-CoV genome sequences were accessed from NCBI GenBank “nucleotide” database. Nucleotide sequences for each gene were extracted based on the available gene annotation information. Additional 5 MERS-CoV sequences were collected from and assembled using IRMA ((iterative refinement meta-assembler)(32) with a customized built Coronavirus module. The virus annotation data is detailed in Supplementary data table S1, some sequences only contain one or some of the genes in SARS-CoV-2. These sequences were collected and analyzed together with the Beta-CoVs genomes accessed from GISAID. After removing duplicate records, our final analysis is based on 5249 sequences including 5209 isolates from humans, 19 isolates from pangolins, and 16 isolates from bats and 5 MERS-CoV isolates. The sequence datasets for each of the 11 genes in SARS-CoV-2 and the full-genome sequence dataset were constructed.