The power of coalescent methods for inferring recent and ancient gene flow in endangered Bactrian camels
Data files
Feb 27, 2025 version files 37.50 MB
-
2024Zhu-camel-data.tgz
37.49 MB
-
README.md
3.32 KB
Abstract
Genomic sequence data harbour valuable information concerning the history of species divergence and interspecific gene flow, and may offer important insights into conservation of endangered species. However, extracting such information from genomic data requires powerful statistical inference methods. A recent analysis of genomic sequence data found little evidence for gene flow from domestic Bactrian camels into the endangered wild Bactrian species. Nevertheless, the methods used to infer gene flow are based on data summaries and lack the power and precision to represent the complex phylogenetic history of the species with gene flow. Here we apply newly developed Bayesian methods to genomic sequence data to test for both recent and ancient gene flow among the three species in the genus Camelus, and to estimate the strength and timing of gene flow. We detect strong signal of gene flow from domestic into wild Bactrian camels, confirming early evidence based on mitochondrial DNA and the Y chromosome. Overall gene flow appears to affect the autosomal genome uniformly, with similar effective rates of gene flow for exonic and noncoding regions. Estimation of species divergence times is seriously affected if gene flow is not accommodated in the analysis. Our results highlight the power of the coalescent model in analysis of genomic data and the utility of the coding as well as noncoding parts of the genome in elucidating the evolutionary history of modern species.
https://doi.org/10.5061/dryad.3xsj3txrk
Description of the data and file structure
File: 2024Zhu-camel-data.tgz
Archive containing files that are described in detail below.
exon-data
We used the genome annotation of C. ferus (GCF_000311805.1) to identify known exons, and included all exons except short ones of less than 100bp. A total of 120,720 exons (loci) were collected. Each exon was treated as an independent locus in the MSC model. The raw data with 120,720 exons loci is divided into 13 partitions, named (CDS_fil.seq.aa) to (CDS_fil.seq.am). To ensure the convergence of numerical calculations, We further separated them into 49 subsets, each of 2500 loci, named (seq-01) to (seq-49).
mitochondrion data
We assembled an alignment of whole mitochondrial genomes for 15 Bactrians (C. bactrianus),4 wild Bactrian species (C. ferus), and 11 dromedaries (C. dromedarius). We included two domestic South American species as well: one Vicugna pacos and one Lama Glama. In BPP analysis under the MSC, we treated the whole mitochondrial genome as one locus as all sites in the genome share the same genealogical tree. The data is (camel-mt.txt).
We inferred the maximum likelihood tree using raxml-ng, and the related file is (camel-mt-raxml.tre).
The mapfile and control file needed for BPP analysis of mitochondrion data are (camel-mt-imap.txt and bpp-msc-r1.ctl).
non-coding
The dataset data1.txt consists of 10,000 non-coding genomic segments of length 1kb (referred to as loci), separated by a gap of at least 30kb between loci.
We further separated the 10,000 non-coding loci into four random subsets, each of 2500 loci, and the related files are (data1.quarter1.txt), (data1.quarter2.txt), (data1.quarter3.txt) and (data1.quarter4.txt).
other-files
This folder include four files which are necessary for non-coding data and exon data analysis with BPP. (The files corresponding to the chromosomal data are located in folder # mitochondrion data.)
bpp-msc.ctl: the control file to run analysis under MSC model, which does not specify any gene flow.
bpp-msci-6rates.ctl: control file to run analysis under MSC-I model with six introgression events showed in fig.1a. To conduct analysis with other numbers of rates (such as I5, I4 and I4-final), the Newick tree specified for the ‘species&tree’ option must be edited accordingly.
bpp-mscm-6rates.ctl: control file for running analysis under MSC-M model which incorporates six migration events as depicted in fig. 1a. To conduct an analysis with other numbers of migration events, one needs to edit the option ‘migration’, including the number and the directions of the migration events.
camel-imap-3s.txt: map file required for BPP analysis.
Code/Software
The analysis in the paper is done by software BPP, which is available at:
The data were generated from the genomic sequence data of Ming et al. (2020)
##Descriptions
# exon-data
We used the genome annotation of C. ferus (GCF_000311805.1) to identify known exons, and included all exons except short ones of less than 100bp. A total of 120,720 exons (loci) were collected. Each exon was treated as an independent locus in the MSC model. The raw data with 120,720 exons loci is divided into 13 partitions, named (**CDS_fil.seq.aa**) to (**CDS_fil.seq.am**). To ensure the convergence of numerical calculations, We further separated them into 48 subsets, each of 2500 loci, named (**seq-01**) to (**seq-48**).
# mitochondrion data
We assembled an alignment of whole mitochondrial genomes for 15 Bactrians (C. bactrianus),4 wild Bactrian species (C. ferus), and 11 dromedaries (C. dromedarius). We included two domestic South American species as well: one Vicugna pacos and one Lama Glama. In BPP analysis under the MSC, we treated the whole mitochondrial genome as one locus as all sites in the genome share the same genealogical tree. The data is (**camel-mt.txt**).
We inferred the maximum likelihood tree using raxml-ng, and the related file is (**camel-mt-raxml.tre**).
The mapfile and control file needed for BPP analysis of mitochondrion data are (**camel-mt-imap.txt* and (**bpp-msc-r1.ctl**).
# non-coding
The dataset *data1.txt* consists of 10,000 non-coding genomic segments of length 1kb (referred to as loci), separated by a gap of at least 30kb between loci.
We further separated the 10,000 non-coding loci into four random subsets, each of 2500 loci, and the related files are (**data1.quarter1.txt**), (**data1.quarter2.txt**), (**data1.quarter3.txt**) and (**data1.quarter4.txt**).
# other-files
This folder include four files which are necessary for non-coding data and exon data analysis with BPP. (The files corresponding to the chromosomal data are located in folder # mitochondrion data.)
**bpp-msc.ctl** is the control file to run analysis under MSC model, which does not specify any gene flow.
**bpp-msci-6rates.ctl** is the control file to run analysis under MSC-I model with six introgression events showed in fig.1a. To conduct analysis with other numbers of rates (such as I5, I4 and I4-final), the Newick tree specified for the ‘species&tree’ option must be edited accordingly.
**bpp-mscm-6rates.ctl** is the control file for running analysis under MSC-M model which incorporates six migration events as depicted in fig. 1a. To conduct an analysis with other numbers of migration events, one needs to edit the option ‘migration’, including the number and the directions of the migration events.
**camel-imap-3s.txt** is the map file required for BPP analysis.
## Sharing/Access information
The sequence alignment files and BPP control files are available in the following archive:
http://abacus.gene.ucl.ac.uk/ziheng/data/2024Zhu-camel-data.tgz.
## Code/Software
The analysis in the paper is done by software BPP, which is available at:
http://abacus.gene.ucl.ac.uk/software/#bpp-bayesian-analysis-of-genomic-sequence-data-under-the-multispecies-coalescent-model
Ming, L. et al., 2020. Whole-genome sequencing of 128 camels across Asia reveals origin and migration of domestic Bactrian camels. Commun Biol.
3, 1.
