Unveiling the impacts of land use on the phylogeography of zoonotic New World Hantaviruses
Data files
Feb 22, 2024 version files 14.42 MB
-
AIC_tree.nwk
-
BLAST_Nprot.csv
-
hosts_america_PH.csv
-
Nprot_MaxAlign.fas
-
README.md
-
taxa_in_network_NWH.zip
Abstract
Billions of genomic sequences are stored in public repositories (NCBI) as well as records of species occurrence (GBIF). By implementing analytical tools from different scientific disciplines, data mining on these databases can be a source of information to aid in the global surveillance of zoonotic pathogens that circulate among wildlife. We illustrate this by investigating the hantavirus-rodent system in the Americas, i.e. New World Hantaviruses (NWH). First we draw the circulation of pathogenic NWH among rodents; by inferring the phylogenetic links among 278 genomic samples of the S segment (N protein) of NWH found in 55 species of Cricetidae rodents. Second, machine learning was used to assess the impact of land use on the probability of presence of the rodent species linked with reservoirs of pathogenic hantaviruses. Our results show that hosts are widely present across the Americas. Some hosts are present in the primary forest and agricultural land, but not in the secondary forest; whereas other hosts are present in secondary forest and agricultural land. The diversity of host species allows Hantavirus to circulate on a wide spectrum of habitats, in particular rural rather than urban. We highlight that Public repositories of genomic data and species occurrence are very useful resources for monitoring potential enzootic transmission and spillover of zoonotic viruses in relation with the changes that humans produce in the Biosphere.
README
Unveiling the Impacts of Land Use on the Phylogeography of Zoonotic New World Hantaviruses
Gabriel E García-Peña and André V. Rubio. Ecography 2024. DOI: 10.1111/ecog.06996
Supplementary Material
Description of the data and file structure
Analysis presented in the main article was performed in R (R Core Team 2022); MAFFT (Katoh 2005) and JModelTest2 (Darriba et al. 2012), following 4 main steps:
1. Data Collection and Curation.
BLAST_Nprot.csv : Accession numbers from the BLAST search for Hanatavirus. With this list of accesion numbers, it is possible to download the genetic sequences in R, by using the function read.GenBank() from the library ape. Metadata of these sequences can be accesed with the R code presented in the file: fetch.metadata.R (see code section).
2. Genetic Sequence Alignment and Phylogenetic Inference.
Nprot_MaxAlign.fas: Fasta file with Multiple sequence alignment of the genetic sequences. Fasta file can be read in R with the function read.dna() from the library ape. These data can be analyzed with the software JModelTest2.
AIC_tree.nwk: Phylogenetic relationships with topology and branch lengths infered with JModelTest2; presented in a newick format. The file can be read with the function read.tree() from the R library ape. The Figure of the tree is included in this repository (phylogeny_NWH.jpeg)
3. Phylogenetic Network analysis on the genetic links of Hantaviruses among hosts.
The phylogenetic network was infered from the phylogenetic trees contained in AIC_tree.nwk, the R code (phylonet.R), and the dataset hosts_america_PH.csv.
4. Geographic analysis on the habitat suitability of hosts linked in the phylogenetic network.
taxa_in_network_NWH.zip. The zip file conatins the results of the habitat suitability analysis performed with the R code predict_suitable_habitat.R. These maps are in geojson files named after each species and capture the probability of species presence (X1) and absence (X0) within pixels of 0.25 ° arc inside the distribution area of the species. Within each location, land use variables for 2015 are included. These variables are proportions of the pixel covered by each vegetation type, including: primary forest (primf), primary non-forest (primnf), secondary forest (secdf), secondary non forest (secdn), rangeland (range), pasture (pastr), annual C4 crops (C4ann), perennial C4 crops (C4per), C3 crops perenial (C3per) and annual (C3ann) , and nitrogen fixing plants (C3fix). These land use variables were used to predict X1 and X0.
Files can be viewed with a geographic information software including R.
Occurrence data used to analyse habitat suitability can be accesed from the original source: GBIF data: https://doi.org/10.15468/dl.pqwhfw
Sharing/Access information
Primary data used to perform the analysis can be accessed from the official repositories:
- GBIF data: https://doi.org/10.15468/dl.pqwhfw
- Historical land-use dataset states.nc (LUH2 v2h) covers the period 850-2015 and projections for 2025: https://luh.umd.edu/data.shtml
- Distributions of rodent species from the IUCN: https://www.iucnredlist.org/resources/spatial-data-download
Code/Software
Description of files within this repository:
- fetch.metadata.R: Code of a Web scrapper to retrieve information about the sequences in the NCBI repository.
- BLAST_Nprot.csv: lists with the accession numbers obtained from the BLAST search.
- Nprot_MaxAlign.fas: Fasta file with the nucleotide sequences analysed; 278 genomic samples of the S segment (N protein) of NWH found in 55 species of Cricetidae rodents.
- AIC_tree.nwk: Phylogenetic tree infered.
- hosts_america_PH.csv: List of species known to host New World Hantavirus. Fisrt column contains the genus name, column 2 the speices name, and column 3 denotes (1) whether the species is known to harbor a pathogen strain of Hantavirus, or not (0).
- phylonet.R: Code describing the phylogenetic network analysis.
- predict_suitable_habitat.R: Code of Classification tree analysis on the habitat suitability of the rodent hosts.
- taxa_in_network_NWH.zip: Maps for each species analysed with a prediction of habitat suitability (X1) in the distribution range of the species, drawn from the land use change variables (García-Peña et al. 2021).
Methods
Data analysis follows 4 main steps:
- Data Collection and Curation. GenBank Accesion Numbers of Hantavirus sequences were obtained from a BLAST query, metadata was collected, taxonomic names homogenized, and sequences found in wild animals were selected.
- Genetic Sequence Alignment and Phylogenetic Inference. Genetic data was aligned and used to infer the phylogenetic relationships among the samples.
- Phylogenetic Network analysis on the genetic links of Hantaviruses among hosts. Phylogenetic network was built from the phylogentic tree of New World Hantavirus.
- Geographic analysis on the habitat suitability of hosts linked in the phylogenetic network. Habitat suitability within the distribution areas of each species was modeled with classification trees. Historical records on the species presence were used to assess the land use change in the time of sampling, and train a model to predict the presence of the species based on 12 land use variables. These models were used to predict the area of suitable habittat in a projection of land use for 2025.