Skip to main content

Global dissemination of Influenza A virus is driven by wild bird migration through arctic and subarctic zones

Cite this dataset

Gass, Jonathon (2022). Global dissemination of Influenza A virus is driven by wild bird migration through arctic and subarctic zones [Dataset]. Dryad.


Influenza A viruses (IAV) circulate endemically among many wild aquatic bird populations that seasonally migrate between wintering grounds in southern latitudes to breeding ranges along the perimeter of the circumpolar arctic. Arctic and subarctic zones are hypothesized to serve as ecologic drivers of the intercontinental movement and reassortment of IAVs due to high densities of disparate populations of long distance migratory and native bird species present during breeding seasons. Iceland is a staging ground that connects the East Atlantic and North Atlantic American flyways, providing a unique study system for characterizing viral flow between eastern and western hemispheres. Using Bayesian phylodynamic analyses, we sought to evaluate the viral connectivity of Iceland to proximal regions and how inter-species transmission and reassortment dynamics in this region influence the geographic spread of low and highly pathogenic IAVs. Findings demonstrate that IAV movement in the arctic and subarctic follows seabird migration around the perimeter of the circumpolar north, favoring short-distance flights between proximal regions rather than long distance flights over the polar interior. Iceland connects virus movement between mainland Europe and North America, particularly due to the westward migration of wild birds from mainland Europe to Northeastern Canada and Greenland. Though virus diffusion rates were similar among avian taxonomic groups in Iceland, gulls act as recipients and not sources of IAVs to other avian hosts prior to onward migration. These data identify patterns of virus movement in northern latitudes and inform future surveillance strategies related to seasonal and emergent IAVs with pandemic potential.


Field sample collection

From May 2010 through February 2018, we obtained IAV isolates from various species of seabirds, shorebirds, and waterfowl as well as environmental sampling of avian fecal material from locations throughout Iceland (capture and swab data can be found here: (Dusek et al. 202X)). Live sampled birds were captured using a 18m x 12m cannon-propelled capture net, noose pole, or hand capture. Birds found dead or moribund were also sampled. Hunter-harvested waterfowl and fisheries-bycatch seabirds were sampled as available. All birds were identified to species and, for live birds, individually marked with metal bands. Age characteristics were determined and age was documented for each bird according to the following schemes adapted from U.S. Geological Survey year classification codes: hatched in same calendar year as sampling (1CY), hatched previous calendar year (2CY), hatched previous calendar year or older, exact age unknown (2CY+), hatched three calendar years prior to sampling (3CY), hatched four calendar years prior to sampling (4CY), hatched more than four calendar years prior to sampling (4CY+), or unknown if age could not be determined (U) (Olsen KM, 2004; Prater, Marchant, & Vuorinen, 1977; USGS, 2020). Due to species specific differences, not all aging categories could be applied to all species sampled. All live birds were immediately released following completion of sampling.

To sample for IAV, a single polyester-tipped swab was used to swab the cloaca only (2010-2013) or to first swab the oral cavity then the cloaca (2014-2017). Opportunistic environmental sampling of fecal material was also conducted using a direct swabbing method (2018). Each swab sample was immediately placed in individual cryovials containing 1.25 ml viral transport media (Docherty & Slota, 1988). Vials were held on ice for up to 5 hours prior to being stored in liquid nitrogen or liquid nitrogen vapor. Samples were shipped on dry ice from Iceland to Madison, Wisconsin, USA by private courier with dry ice replenishment during shipping. Once received in the laboratory, samples were stored at -80o C until analysis.

Virus extraction, RT-PCR, virus isolation

Viral RNA was extracted from swab samples using the MagMAXTM-96 AI/ND Viral RNA Isolation Kit (Ambion, Austin, TX) following the manufacturer’s procedures. Real-time RT-PCR was performed using previously published procedures, primers, and probes (Spackman et al., 2002) designed to detect the IAV matrix gene. RT-PCR assays utilized reagents provided in the Qiagen OneStep® RT-PCR kit. Virus isolation was performed in embryonating chicken egg culture on all swab samples exhibiting positive Ct values from RT-PCR analysis (Woolcock, 2008), with a primary cut off value of 45 on primary screen and 22 on secondary screen. All virus isolates were screened for the presence of H5 and H7 IAV subtypes using primers and probes specific for those subtypes (Spackman et al., 2002). Egg-grown virus isolates were sequenced using multiple standard methods including Sanger, Roche 454, and Illumina (HiSeq 2000 and MiSeq) sequencing (Dusek et al., 2014; Guan et al., 2019; Hall et al., 2014).

Datasets for phylodynamic analyses

Global dataset: The PB2 segment was selected as the basis for phylodynamic analysis. Advantages of focusing on PB2 - the largest internal segment of the IAV genome - include maximizing the number of nucleotides in the analysis (>2000 nts) and investigating transmission dynamics without targeting a specific subtype. All available avian and marine mammal IAV PB2 genes sequenced between 2009 and 2019 globally were downloaded from the National Center for Biotechnology Information Influenza Virus Resource database (NCBI IVR) ( on February 12, 2020, resulting in 13,469 sequences. Duplicate sequences (based on collection date, location, and nucleotide content) and sequences with less than 75% unambiguous bases were removed, and all vaccine derivative and laboratory-synthesized recombinant sequences were excluded. Sequences in the dataset were only included if isolation dates, location, and host species were available, resulting in 7,245 remaining sequences.

Downsampling of taxa: The downsampling strategy aimed to reduce the number of sequence taxa for computation and mitigate sampling bias while maintaining the genetic diversity in the dataset. Four variables were considered important for explaining genetic diversity in the IAV sequence dataset: geographic region, host taxa, sampling year, and hemagglutinin (HA) subtype. Geographic regions included North America, Europe, Iceland, Asia, Africa, and South America (Australia and Antarctica were removed due to insufficient sequence counts). HA subtypes included H1, H2, H3, H4, H6, H8, H10, H11, H12, H13, H14, H15, H16, and pooled H5/7/9. H5, H7, and H9 were combined, as these were over-represented in the global dataset. Host categories included Anseriformes, Charadriiformes, Galliformes, and Other, which comprised all other avian taxa and marine mammals.

To inform the downsampling strategy and evaluate if any of the four variables were correlated, a multiple correspondence analysis (MCA) was performed (JMP Pro v.14.0.0 (JMP Version 14.0.0, 1989-2019)). The MCA uses categorical data as input, which for this study included the sampling metadata associated with each sequence (region, host taxa, year, and HA subtype). Through representation of the variables in two-dimensional Euclidean space, significant clustering of HA subtypes with host taxa was detected (Supplementary Fig. 1), indicative of host-specific subtypes that are a well-known feature of influenza. These findings confirmed by previously published data on species-specificity of HA subtypes (Byrd-Leotis, Cummings, & Steinhauer, 2017; Long, Mistry, Haslam, & Barclay, 2019; Verhagen et al., 2015) led us to downsample the dataset stratifying taxa by two non-overlapping variables: geographic region and HA subtype. Data were downsampled to maintain 21-75 taxa per geographic region category and 6-30 per HA subtype category, resulting in a total of 301 sequences (outgroup). This step was performed to mitigate sampling bias resulting in over-representation of species or viral strains, while accounting for genetic diversity in the dataset. Next, to ensure relative evenness of geographic state groupings for discrete trait analyses, virus sequences from Iceland (n=93) were downsampled by stratifying taxa by HA subtype and maintaining 1-15 sequences per category, resulting in 63 sequences (ingroup). These 63 sequences were used for global and local discrete trait analyses and reflected the composition of diverse subtypes by host for the full Iceland sequence dataset. The resulting dataset reflected the underlying composition of host-specific subtypes present in this localized system. To assist with rooting and time-calibration of the tree, historical avian sequences from NCBI IVR were downloaded for the years 1979-2008. These were downsampled by year to ensure one sequence per year, resulting in 30 historic sequences. The total downsampled dataset, including the outgroup (n=301), ingroup (n=63), and historic sequences (n=30) resulted in a total of 394 sequences.

Europe-Iceland-North America Datasets: To elucidate viral dynamics between significant source regions and Iceland and within-Iceland phylodynamics, a second analysis was performed at a restricted scale to Europe, Iceland, and North America. The cleaned global dataset described above (n=7245) was downsampled to include significant source regions of North America (n=3222) and Europe (n=407), totaling 3629 sequences. To identify at lower spatial resolution the source/sink locations relevant to Iceland, a K-means cluster analysis was performed (JMP Pro v.14.0.0 (JMP Version 14.0.0, 1989-2019)) using latitude/longitude coordinates for each of the 3629 sequences (obtained by extracting sampling location from the strain name of each sequence and searching in A total of 20 intraregional clusters resulted in highest support. Identified clusters with <50 sequences were combined with geographically proximal clusters to increase evenness of within cluster sequence counts for discrete traits analyses, resulting in 13 intraregional cluster locations within North America, Iceland, and the rest of Europe (Supplementary Fig. 2).

Viral sequences were then downsampled from 3629 to 743 taxa stratifying by intraregional cluster groupings and HA subtype. Two datasets were formed, both of which included the 743 downsampled sequences (557 from North America, 229 from Europe excluding Iceland), 30 historic sequences (same as the global analysis), and: i) for discrete trait phylodynamic analyses: 63 downsampled Iceland-derived sequences, totaling 836 sequences, and ii) for continuous trait phylogeographic analysis, all 93 Iceland-derived sequences were included, totaling 866 sequences. The two datasets were formed because discrete trait analyses require evenness of trait groupings, whereas continuous trait analyses are robust against unevenness, thus the full Iceland dataset was included for continuous analyses. Multiple sequence alignments of PB2 sequences were performed using MUSCLE in Geneious Prime v.2020.01.02. and trimmed to the open reading frame.

Time-scaled Bayesian phylogenetic analyses

To construct time-scaled trees, Bayesian molecular clock analyses were conducted using the Markov chain Monte Carlo (MCMC) method in BEAST v.1.10.4 (Suchard et al., 2018). Phylogenies were reconstructed using a) a Generalized Time-Reversible model (GTR) of nucleotide substitution with a gamma distribution of site heterogeneity (also known as the Yang96 model), b) a coalescent-based Gaussian Markov random fields (GMRF) Skyride model for estimating effective population size dynamics (Minin, Bloomquist, & Suchard, 2008), and c) an uncorrelated relaxed lognormal clock. The BEAGLE library was used to optimize computational efficiency (Baele, Ayres, Rambaut, Suchard, & Lemey, 2019). Eight independent MCMC analyses were run for 200 million generations, sampled every 20,000 runs, and parameter convergence and effective sample size (ESS) (required to be >200) were evaluated in Tracer v.1.7.1. Using LogCombiner v.1.10.4, at least 10% burn-in was removed from each run and independent runs were combined to establish the maximum clade credibility (MCC) tree, from which the last 500 trees from the posterior distribution were extracted and used as the empirical tree for all subsequent phylodynamic analyses (Pagel, Meade, & Barker, 2004). Trees were visualized using Figtree v1.4.4.

Phylogeographic and phylodynamic analyses

To infer transitions between geographic states, evaluate virus diffusion among host taxa, and reassortment patterns and rates between IAV subtypes, joint discrete and continuous trait Bayesian phylodynamic analyses were performed using BEAST v.1.10.4. An asymmetric substitution model with Bayesian stochastic search variable selection (BSSVS) and a strict clock model were used to estimate diffusion between discrete states. Statistical significance for diffusion between discrete states was determined by the following Bayes Factors (BF): 3 ≤ BF < 20, 20 ≤ BF <150, and BF ≥ 150, denoting positive, strong, and very strong support, respectively (Kass & Raftery, 1995). To quantify transitions between geographic or host states (Markov jumps, i.e. the frequency of transitions from one geographic or host state to another along phylogenetic branches) and the duration of time viruses spend in each state (Markov rewards), posterior inference of the complete Markov jump history through time was included. Host-specific wavefront distance estimates were also calculated, which represent the distribution of great-circle distances (in kilometers) traveled from tree root location through time (Minin & Suchard, 2008; Trovão, Suchard, Baele, Gilbert, & Lemey, 2015).

Continuous Bayesian phylogeographic analyses were performed by annotating the empirical tree with latitude and longitude locations of all sampling sites via to infer and visualize the evolutionary and geographic relationships between global sequences. Sequences between 1978 and 2008 were masked so that they contributed to the overall tree structure but were not included in the calculation of diffusion rates globally or within Iceland (Hicks, Dimitrov, Afonso, Ramey, & Bahl, 2019). A Cauchy relaxed random walk (RRW) model and an uncorrelated relaxed lognormal clock were used and a random jitter of 0.001 was applied to isolates with matching geographic latitude and longitude coordinates (Lemey, Rambaut, Welch, & Suchard, 2010). Wavefronts, the distribution of great-circle distances (in kilometers) traveled from tree root location through time, and host-specific diffusion rates, the rate of virus movement (in kilometers per year) were calculated. GIS mapping was performed using QGIS v.3.16.3 (Team, 2020).

Ethical approvals

This research was conducted under approval of the U.S. Geological Survey National Wildlife Health Center's Animal Care and Use Committee, protocol numbers EP090325, EP120302, and EP153020 in strict accordance with guidelines set forth in the U.S. Governments Animal Welfare Act and the National Institutes of Health Office of Laboratory Animal Welfare. Permits for the capture and sampling of wild birds were issued by the Icelandic Institute of Natural History (Permit Numbers 368 and 403). Permits to ship collected cloacal swab samples from Iceland to the United States (US) were obtained from the Icelandic Institute of Natural History, the US Department of Agriculture, and the US Fish and Wildlife Service. All sampling occurred in conjunction with the Southwest Iceland Nature Research Institute and/or the University of Iceland, Snæfellsnes Research Centre on public and private land within regulations of these institutes and the Icelandic Institute of Natural History and, in the case of private lands, with the specific permission of landowners.

Usage notes

Contains 394 PB2 sequences, GenBank accession numbers, and all metadata for the global analysis 

Contains 866 PB2 sequences, GenBank accession numbers, and all metadata for the North Atlantic analysis 


National Institute of Allergy and Infectious Diseases, Award: HHSN272201400008C