The legacy of recurrent introgression during the radiation of hares
Ferreira, Mafalda S. et al. (2020), The legacy of recurrent introgression during the radiation of hares, Dryad, Dataset, https://doi.org/10.5061/dryad.bzkh18967
We generated new genome-wide resequencing data targeting 207,691 exonic and non-coding regions [totaling 61.7 Megabases (Mb) total] from 14 hare species (30 individuals)and the outgroup pygmy rabbit (Brachylagus idahoensis; 2 individuals). We combined these data with published whole exomes from four snowshoes hares (Lepus americanus; NCBI Sequence Read Archive BioProject PRJNA420081) and extracted data from the reference genome of European rabbit (Oryctolagus cuniculus; OryCun2.0) to use as a second outgroup. Our total sample of 15 hare species (34 individuals, 1 to 4 individuals per species) and 2 outgroup species (3 individuals) included species from all major regions of the Lepus native distribution. A smaller subset of 32 hare individuals (15 hare species), 1 pygmy rabbit individual and the European rabbit reference were used for subsequent analyses (check Supplementary Table S1 and Materials and Methods).
We reconstructed the evolutionary history of Lepus using maximum likelihood with a concatenated alignment and multispecies coalescent approaches, with ASTRAL and SVDquartets. We also studied phylogenetic incongruence among exome regions underlying the species tree. We infered hybridization among extant and ancestral branches of the species tree using network based inferences (PhyloNet and TreeMix) and summary statistics (minimum D-statistic, 'f-branch' statistic, admixture proportion and fraction of admixture).
The pipeline and custom scripts used to recapitulate the analyses in this work can be found at https://github.com/evochange/hare-phylogenomics.
Supplementary Materials and Methods.
Supplementary Figures S1 to S13.
Supplementary Table S1 to S15.
Pseudo-reference genomes for each species generated by iteratively mapping (four iterations and allowing ambiguity codes) cleaned single and paired-end reads from one individual per species (Supplementary Table S1 available on Dryad) to the European rabbit reference genome (OryCun2.0). Names correspond to sample IDS in Supplementary Table S1.
Consensus fasta sequences for hares and pygmy rabbit used as input for all analyses in this study. Inside the zip file there is a file per individual (identified by Sample ID in Supplementary Table S1) containing one fasta sequence per chromosome.
ASTRAL species tree obtained from 181 chromosome X gene trees, displayed in Supplementary Figure S8.
ASTRAL species tree obtained with 8889 gene trees without (“lineage”) or with (“species”) assigning individuals to species. These trees are displayed in Supplementary Figure S2.
Maximum likelihood gene trees from 181 X-linked regions used as input to generate the ASTRAL species tree astral5.6.3_chrX_TCs5_gene_trees.tre displayed in Supplementary Figure S8.
Maximum likelihood gene trees from 8889 regions used as input to generate the ASTRAL species trees astral5.6.3_lineage_TCs5_gene_trees.tre and astral5.6.3_species_TCs5_gene_trees.tre displayed in Supplementary Figure S2.
Fasta alignments generated from partitioning the genome in 50 kilobase (kb) windows with no overlap. Only windows inside target regions in the capture and 200 bp flanking regions were retained. Positions with more than 30% missing data were excluded from the alignments. These fasta alignments were used as input for RAxML to generate maximum likelihood gene trees (gene_trees_raxml_output.zip) that were later used for the multispecies coalescent analysis implemented in ASTRAL
SNP alignments in nexus format containing only 1473 X-linked independent SNPs (snp_alignment_33indvs_chr_X.nex) or containing 45,779 exome-wide independent SNPs (snp_alignment_33indvs_WithX.nex) used as input for SVDquartets.
SVDquartets output log files (no extension) and tree files (.tre). The analyses were performed with (“species”) or without (“lineage”) assigning individuals to species, and with X-linked (“chr_X”) or whole-exome (“WithX”) SNPs. Exome-wide trees are displayed in Supplementary Figure S3 and Chromosome X trees are displayed in Supplementary Figure S8.
Concatenated whole-exome maximum likelihood tree:
Fasta alignment containing 11,949,529 exome-wide concatenated positions with no missing data used to generate the Maximum Likelihood tree in Supplementary Figure S1.
RAxML output files used to estimate a single bifurcating phylogeny using the fasta alignment exome_alignment_33indvs_f100.fas.gz and using a maximum likelihood (ML) search and rapid bootstrapping run under the GTR+GAMMA model of sequence evolution.
Bayesian estimation of divergence times:
PHYLIP alignment with data for 15 hare species (1 individual per species) and 1 outgroup species containing 3 partitions corresponding to the 1st, 2nd and 3rd codon positions of 9015 protein coding genes used as input for Bayesian estimation of divergence times of the Lepus species tree with MCMCtree.
Output tree files obtained after Bayesian estimation of divergence times with MCMCtree. We run MCMCtree twice (run 1 and run2) per calibration strategy (a) molecular-based dates extrapolated from deep fossil calibrations in the lagomorphs (tree 3) or (b) fossil calibrations for the divergence of the genus Lepus and the Lepus species (tree5). See Materials and Methods for details. Run1 trees are displayed in Supplementary Figure S5.
Folders containing all input files (alignment and species with age priors), control files (mcmc.ctl) and output files (trees and log files for hessian and approximate maximum likelihood estimations) obtained from two MCMCtree runs with different calibration strategies (a) molecular-based dates extrapolated from deep fossil calibrations in the lagomorphs (tree 3) and (b) fossil calibrations for the divergence of the genus Lepus and the Lepus species (tree5).
Majority Rule Consensus Tree
Majority Rule Consensus Tree obtained with RAxML (-L MRE option) with the set of gene trees (gene_trees_TCs5.tree.gz) used to generate the ASTRAL species tree. This tree is displayed in Supplementary Figure S7.
Nexus file containing 8889 gene trees and PhyloNet command used to produce pseudo-maximum likelihood networks with 0 to 4 reticulation events displayed in Supplementary Figure S10.
PhyloNet raw outputs obtained from pseudo-maximum likelihood network analyses with 0 to 4 reticulation events. Outputs contain the results of 10 interactions and the five best networks. The best networks from each run are displayed in Supplementary Figure S10.
Allele frequencies for 30,709 biallelic SNPs from 15 hare species used as input for ancestral population graph reconstruction with TreeMix.
Output files for 10 TreeMix runs with 0 to 9 reticulations events using allele frequency data of 30,709 biallelic SNPs for 15 hare species snp_alignment_LepusOnly_hap_treemix_inputtreemix_file_out.txt.gz. The networks in these files and corresponding residuals are displayed in Supplementary Figures S11 and S12.
Genetic divergence and nucleotide diversity:
Geno file containing the whole-exome alignment used to calculate genetic divergence between and nucleotide diversity and heterozygosity within hare species with Genomic Generals (https://github.com/simonhmartin/genomics_general) scripts. This was also the input file used to calculate fhom among black-tailed jackrabbits and snowshoe hare populations.
Nucleotide diversity (“pi”) and genetic divergence (“dxy”) in 1 Megabase overlapping windows calculated for 15 hare species with the script popgenWindows.py from Genomics General (https://github.com/simonhmartin/genomics_general). Results average across windows can be found in Supplementary Table S8 and S9.
Minimum D-statistics (Dmin):
Derived allele frequencies for 15 hare species calculated with freq.py from Genomics General (https://github.com/simonhmartin/genomics_general) and the input file wg_filtered_wHead.geno.gz.
D-statistic values (D_stat_by_species_results_FINAL.txt) for all possible trios of species in the dataset of 15 hare species analyzed in this study and the minimum D-statistic per trio (Dmin_by_species_sig_FINAL.txt). These results are represented in Supplementary Figure S9.
‘f-branch’ statistics (fb(C)):
Derived allele frequencies for several taxa configurations (per species, per individual, per population or per haploid chromosome of the white-sided jackrabbit) used to calculate admixture proportion for all f(A,B,C,O) conformations necessary for the ‘f-branch’ statistic. In the table fb_C_comparisons_f_G.txt we list the ABBA-BABA configuration used to calculate admixture proportion for each branch b and C pair in the species tree (Figure 1b). The order of the columns is 1) branch b; 2) branch a (sister to b); 3) all B descendants of b; 4) all A descendants of a; 5) donor branch C; 6) taxa serving as Ca (or P3a); 7) taxa serving as Cb (or P3b).
Admixture proportion results for all f(A,B,C,O) configurations in fb_C_comparisons_f_G.txt used as input to calculate ‘f-branch’ statistics (fb(C)) among all pairs of b and C branches in the species tree (Figure 1b).
Scans of fraction of admixture (fd):
Fraction of admixture (fd) between snowshoe hares and Alaskan hares, mountain hares or white-tailed jackrabbits for 50 kb overlapping (5 kb) sliding windows (.out files) and European rabbit annotations for the significant outlier windows in each scan (.txt files). Fraction admixture files were obtained with ABBABABAwindows.py from Genomics General (https://github.com/simonhmartin/genomics_general).
European rabbit annotations of the windows that are significant in all three fraction of admixture (fd) scans performed between snowshoe hares and Alaskan hares, mountain hares or white-tailed jackrabbits. This annotation was used for Gene Ontology enrichment analysis with g:Profiler.
Genetic divergence (dxy) among snowshoe hares, Alaskan hares, mountain hares and white-tailed jackrabbits in 50 kb overlapping (5 kb) windows obtained with popgenWindows.py from Genomics General (https://github.com/simonhmartin/genomics_general), and used to determine dxy of the fraction of admixture (fd) outlier windows. Results of this scan are displayed in Supplementary Figure S13.
Fundação para a Ciência e a Tecnologia, Award: Project Grant "CHANGE" - PTDC/BIA-EVF/1624/2014, Portuguese National Funds
National Science Foundation, Award: EPSCoR OIA-1736249 and DEB-1907022 Grant
Fundação para a Ciência e a Tecnologia, Award: PD/BD/108131/2015 PhD Grant, POPH-QREN from ESB and Portuguese MCTES/FCT
Fundação para a Ciência e a Tecnologia, Award: PTDC/BIA-EVF/1624/2014
National Science Foundation, Award: OIA-1736249
National Science Foundation, Award: NSF Graduate Research Fellowship (DGE-1313190)
Fundação para a Ciência e a Tecnologia, Award: CEEC Contract (CEECIND/00372/2018)
European Union's Seventh Framework Program, Award: CIBIO NEW-GEN Sequencing Platform (no. 286431)
M.J. Murdock Charitable Trust, Award: University of Montana Genomics Core
Eunice Kennedy Shriver National Institute of Child Health and Human Development, Award: R01HD073439
COMPETE2020, PORTUGAL2020, and ERDF, Award: POCI-01-0145-FEDER-022184
Fundação Luso-Americana para o Desenvolvimento, Award: Portugal-United States of America Research Networks Program Funds
COMPETE2020, PORTUGAL2020, and ERDF, Award: POCI-01-0145-FEDER-022184