Skip to main content

Wallacean and Melanesian islands promote higher rates of diversification within the global passerine radiation Corvides: Supplementary information

Cite this dataset

McCullough, Jenna et al. (2022). Wallacean and Melanesian islands promote higher rates of diversification within the global passerine radiation Corvides: Supplementary information [Dataset]. Dryad.


The complex island archipelagoes of Wallacea and Melanesia have provided empirical data behind integral theories in evolutionary biology, including allopatric speciation and island biogeography. Yet, questions regarding the relative impact of the layered biogeographic barriers, such as deep-water trenches and isolated island systems, on faunal diversification remain underexplored. One such barrier is Wallace’s Line, a significant biogeographic boundary that largely separates Australian and Asian biodiversity. To assess the relative roles of biogeographic barriers—specifically isolated island systems and Wallace’s Line—we investigated the tempo and mode of diversification in a diverse avian radiation, Corvides (Crows and Jays, Birds-of-paradise, Vangas, and allies). We combined a genus-level dataset of thousands of ultraconserved elements (UCEs) and a species-level, 12-gene Sanger sequence matrix to produce a well-resolved supermatrix tree that we leveraged to explore the group’s historical biogeography and effects of biogeographic barriers on their macroevolutionary dynamics. The tree is well-resolved and differs substantially from what has been used extensively for past comparative analyses within this group. We confirmed that Corvides, and its major constituent clades, arose in Australia and a burst of dispersal west across Wallace’s Line occurred after the uplift of Wallacea during the mid-Miocene. We found that dispersal across this biogeographic barrier were generally rare, though westward dispersals were two times more frequent than eastward dispersals. Wallacea’s central position between Sundaland and Sahul no doubt acted as a bridge for island-hopping dispersal out of Australia, across Wallace’s Line, to colonize the rest of Earth. In addition, we found that the complex island archipelagos east of Wallace’s Line harbor the highest rates of net diversification and are a substantial source of colonists to continental systems on both sides of this biogeographic barrier. Our results support emerging evidence that island systems, particularly the geologically complex archipelagoes of the Indo-pacific, are drivers of species diversification.



We generated two datasets: a genus-level matrix of ultraconserved elements (UCEs; Faircloth et al. 2012), and a species-level supermatrix of 12 Sanger-sequenced genes. We sampled 86 passerine taxa for our UCE matrix, which comprised 77 species (76 genera) representing 29 of the 30 families within Corvides (IOC v.8.2; Gill and Donsker 2018); we lacked only the Bornean endemic family, Pityriasidae (Table S1). We diverged from IOC v.8.2 taxonomy by treating Platylophus, which has been traditionally considered within Corvidae, as its own monotypic family, Platylophidae, (Aggerbeck et al. 2014; Winkler et al. 2015; Oliveros et al. 2019). We also sampled nine outgroups from the following families: Meliphagidae, Orthonychidae, Pomatostomidae, Cnemophilidae, Petroicidae, Cisticolidae, Stenostiridae, Passeridae, and Chloropseidae. For our species-level supermatrix, we included up to 12 Sanger-sequenced gene regions from 728 species, representing 87% of all Corvides species (Table S2–3). The majority of these data were downloaded from GenBank. For the 86 species in the UCE matrix, we mined off-target, mitochondrial sequencing reads and combined these with multilocus nuclear data from GenBank. Finally, we sequenced de novo RAG-1 and RAG-2 for some individuals (Table S2–3). The UCE and multilocus datasets are mutually exclusive and are linked by the 86 individuals that were shared between the two datasets. 


Laboratory techniques

We extracted and purified DNA from fresh muscle or liver tissue or toepad samples from museum specimens using the Qiagen DNeasy Blood and Tissue Kit following the manufacturer’s protocol. We quantified DNA extracts with a Qubit 2.0 Fluorometer and sheared 500 ng of DNA with a Covaris S220 sonicator (175 W peak incident power, 2% duty factor, and 200 cycles per burst for 45 seconds). Following established protocols (Faircloth et al. 2012, McCormack et al. 2016), we performed end repair, A-tailing, adapter ligation, and amplification on sheared DNA using the Kapa Biosystems Library Prep kit. We deviated from this protocol in three ways: we used ¼ volume of reagents as called for in the library prep kit, we ligated universal iTru stubs instead of sample-specific adapters to permit dual-indexing, and we added a second AMPure XP bead cleanup at 1.0X volume after stub ligation. We added dual-indexed iTru adapters ( to DNA fragments in a 17-cycle PCR using NEB Phusion High-Fidelity PCR Master Mix. We combined libraries in pools of eight equimolar samples and enriched pooled libraries for 24 hours using the Arbor Biosciences MYbaits kit for Tetrapods UCE-5Kv1, which targets 5,060 UCE loci. After post-enrichment amplification, we quantified libraries using an Illumina Eco qPCR System and sequenced them on a high output paired-end run of 100 cycles on an Illumina HiSeq 2500 System at the University of Kansas Genome Sequencing Core. We multiplexed 96 individuals on a single Illumina lane, including 10 samples for other projects.


UCE Data Assembly

We processed UCE data using the Python bioinformatics pipeline PHYLUCE (Faircloth 2016). Raw reads were de-multiplexed using CASAVA v.1.8.2. To trim adapter-sequences and low-quality bases from reads, we used the parallel wrapper Illumiprocessor v.1 (Faircloth 2013) around Trimmomatic v.0.32 (Bolger et al. 2014). We assembled cleaned reads into contigs using Trinity v.r2013.08.14 (Grabherr et al. 2011) and extracted contigs matching UCE loci. We aligned loci with Mafft v.7.130b (Katoh and Standley 2013), allowing missing nucleotides at the flanks of the alignment only if at least 65% of taxa contained data, the default option in PHYLUCE. We trimmed alignments using Gblocks v.0.91b (Castresana 2000) and default parameters with the exception of the minimum number of sequences for a flank position in Gblocks, which we set at 65% of taxa. We generated a complete dataset (composed of loci common to all taxa) and a 75%-complete dataset (for which at least 64/86 taxa were represented per UCE locus). 


Phylogenetic Analyses of UCE Data

We performed maximum likelihood (ML) inference on the concatenated loci of the 75%-complete dataset using RAxML v.8.1.3 (Stamatakis 2014) assuming a general time-reversible model of rate substitution and gamma-distributed rates among sites (GTR+G). We evaluated node support using 1000 rapid bootstraps. Next, we used gene-tree-based coalescent methods (GCM) to estimate Corvides relationships applying strategies of subsampling taxa, using only the most informative loci, and separately analyzing samples with short sequences with trimmed alignments to increase resolution and consistency between estimates. We performed coalescent analysis on a total of five subsets of the 86 species. The first subset contained 21 species that focused on the early branches and major superfamilies in Corvides. The second subset consisted of the same species but also included Mohoua albicilla—the only sample that contained substantially lower mean contig length compared to the rest of the samples—because it was derived from a toepad sample. Using Gblocks and following best practices of Hosner et al. (2016), we trimmed this second subset alignment so that no missing data were present at their flanks (i.e., to match the reduced length of M. albicilla). The remaining three subsets focused on the three superfamily clades recovered by Aggerbeck et al. (2014; Fig. 1C). We refer to these groups as “Orioloidea,” “Malaconotoidea,” and “Corvoidea”, corresponding to clades X, Y, and Z of Aggerbeck et al. (2014), respectively. In summary, these last three subsets comprised 25, 25, and 43 species from Orioloidea, Malaconotoidea, and Corvoidea, respectively. We removed uninformative loci for GCM analyses based on whether a locus’ number of parsimony-informative sites and bootstrap support was lower than the mean for that dataset (Table S8).

For coalescent analyses, gene-tree inference and bootstrapping were performed in RAxML v.8.1.3 (Stamatakis 2014), as implemented in PHYLUCE (Faircloth 2016). We modified the PHYLUCE scripts to implement multi-locus bootstrapping (i.e., sampling with replacement of loci and sites; Seo 2008) and generated 500 multi-locus, bootstrap replicate sets of gene trees for each dataset. On each replicate set of gene trees, we performed coalescent analyses using four GCMs: STAR and STEAC, as implemented in the R package phybase v.1.3 (Liu et al. 2009b), MP-EST v.1.4 (Liu et al. 2010), and ASTRAL v.4.7.7 (Mirarab et al. 2014). All programs were run with default options. Species trees inferred from the bootstrap replicates were summarized with 70% consensus trees using SumTrees v3.3.1, part of the DendroPy 3.12.0 package (Sukumaran and Holder 2010, 2015). 

Supermatrix Assembly

We assembled a species-level data matrix using Sanger sequence data available on GenBank. Following the taxonomy of the International Ornithological Congress World Bird List (IOC v.8.2; Gill and Donsker 2018), we downloaded sequence data for 12 loci, which comprised four mitochondrial (COI, Cyt-b, ND2, and ND3) and eight nuclear (c-mos, Fib-5, GAPDH, Myo2, ODC, RAG-1, RAG-2, and TGFb2) for 728 species (Table S2). These 12 gene regions are widely used in higher-level avian systematics (Groth and Barrowclough 1999; Hackett et al. 2008; Moyle et al. 2009a; Tello et al. 2009) and specifically to infer relationships in Corvides (Cicero and Johnson 2001; Filardi and Moyle 2005; Irestedt et al. 2008; Nyári et al. 2009; Jønsson et al. 2011, 2016; Slager et al. 2014; Andersen et al. 2015a). We preferentially chose sequences from a single vouchered specimen and attempted to include as many taxa with sequence data that were available on GenBank as of September 2018. Additionally, we incorporated new RAG-1 and RAG-2 sequences (see Barker et al. 2004 for established amplification and sequencing protocol) and off-target mitogenome reads of UCE data, following Andersen et al. (2019) (Table S2). Due to the piecemeal nature of supermatrices, not all taxa had genetic data for all gene regions. Our dataset averaged 5.2 +/- 3.1 gene regions per taxon. At the low end, ca. 10% of tips (79/728) were represented by a single gene region, whereas at the high end, ca. 10% (77/728) were represented by ≥10 gene regions. We aligned gene-specific datasets using Muscle (Edgar 2004) in Geneious v.R9 (Kearse et al. 2012). We explored this dataset extensively for errors of sample misidentification by inspecting alignments by eye and as evidenced by conflicting gene trees. We used this dataset to estimate a UCE-constrained topology of all Corvides.


Phylogenetic Analysis of Supermatrix and Time Calibration

We used BEAST v2.5.1 (Bouckaert et al. 2014) on the CIPRES Science Gateway (Miller et al. 2010) to estimate a UCE-constrained, time calibrated phylogeny of the full 728-taxon Sanger matrix. To constrain the topology to the UCE backbone, we set up 70 monophyly statements in BEAUti for well-supported, higher-level nodes in the UCE constraint tree but allowed species-level relationships to be inferred based on the 12-gene matrix. In practice, we ensured that tip names of the 86 UCE samples were identical to their corresponding tips in the supermatrix so that BEAST recognized them as equals in the constraint analysis. For uncertain nodes in our backbone dataset, we followed the topology of Oliveros et al. (2019). Rather than independently calibrate a heterogeneous dataset, we incorporated secondary calibration points drawn from Oliveros et al. (2019) on 30 of the monophyly-constrained nodes. Oliveros et al. (2019) inferred their higher-level Passeriformes tree using 13 fossil calibrations and a family-level UCE dataset of over 200 songbirds. The deepest node we calibrated was the split between Foulehaio and all other taxa (31.1286 Ma) and crown Corvides was calibrated at 25.57 Ma (see Figure S1 for placement of all monophyly statements and Table S4 for nodes with secondary calibrations). We conducted nine independent Markov chain Monte Carlo (MCMC) runs of 10generations with unlinked site models (GTR) for each of 12 partitions. We applied an uncorrelated lognormal (UCLN) clock model and a birth-death tree model, which were linked across partitions. We assessed convergence of parameter estimates and determined that ESS values were all > 200 by visualizing trace files in Tracer v.1.6 (Rambaut & Drummond, 2007). We discarded the first 16–26% of MCMC generations as burn-in, resulting in 38,834 trees. We sub-sampled every 4th tree before summarizing the remaining 9,708 trees in a 50% maximum-clade-credibility tree using LogCombiner v.2.5.1 and TreeAnnotator v.2.5.1 (Drummond and Rambaut 2007). 


Historical Biogeography

We investigated the biogeographical history of Corvides using BioGeoBEARS (Matzke 2013, 2014) in R (R Development Core Team 2013). This program allows researchers to implement user-defined geographic areas that are constrained to infer ancestral ranges on a time-calibrated phylogeny in a likelihood framework (Matzke 2013). We tested the program’s six different biogeographic models (DEC, DEC+j, DIVA, DIVA+j, BayArea-like, and BayArea-like+j) and used AIC to assess model fitness, preferring the model with the lowest AIC score. The dispersal-extinction-cladogenesis (DEC) model allows for range expansion (dispersal) and contraction (extinction) along with constant cladogenesis, in which both daughter lineages have equal probability to inherit the ancestral range. The inclusion of the founder event parameter (+j) allows for long-distance dispersal, in which the daughter lineage inhabits a range that the ancestor did not. Though this parameter has been the subject of criticism as it has a tendency to over-explain biogeographic variation (Ree and Sanmartín 2018), we evaluated our results with and without +j. However, we included the +j parameter because it is important for highly dispersive island lineages, for which Corvides has many examples (such as Pachycephala whistlers,Edolisoma cicadabirds, Rhipidura fantails, and even Chasiempis monarchs and one Corvus that have reached Hawai’i). 


We estimated eight biogeographic scenarios, whereby we coded from two to seven areas globally (Tables S5–6). The 2-area scheme explored dispersal out of the Sahul Region across Wallace’s Line (from east to west), whereas our 3-area scheme isolated the Philippines and therefore tested the effects of Huxley’s Line (Huxley 1868). The 4-area scheme divided the world west of Wallace’s Line into three areas (Africa, the Palearctic and Southeast Asia [“Eurasia”], and the Americas), whereas the 5-area scheme again isolated the Philippines in another exploration of Huxley’s Line (Esselstyn et al. 2010). The 6-area scheme further divided New Guinea into its own area, separate from Australia. The 7-area scheme split the oceanic islands of Wallacea and the Philippines as separate from the Pacific archipelagos of Melanesia and Polynesia, inclusive of Hawai’i and New Zealand. We aimed to account for geological evidence that indicates the emergence of New Guinea as a biogeographically relevant landmass occurred between 5–12 Ma (Hill and Hall 2003; van Ufford and Cloos 2005; Toussaint et al. 2014). Therefore, we ran the 6- and 7-area schemes with and without a time constraint on the orogeny of New Guinea. We set this time constraint on New Guinea as an ancestral area at 15 Ma, following Moyle et al. (2016). 


State-dependent Diversification

To assess how geographic barriers influence diversification within Corvides, we fit eight different state-dependent speciation and extinction models (SSE; Maddison et al. 2007) in RevBayes (Höhna et al. 2016). SSE models estimate per-lineage speciation and extinction rate parameters specific to particular traits, such as geography. We assessed how diversification within Corvides is best explained by in situ diversification within island/continental systems, crossing Wallace’s Line, a combination of both, or as an unmeasured trait. 


Generally, we tested two traits: island occurrence and dispersal across Wallace’s Line (Wallace 1863). We categorized islands as those that were either of oceanic origin (e.g., Solomon Islands, Mascarene Islands) or continental drift islands that had separated for the duration of the crown age of Corvides to the present (e.g., Greater Antilles, Madagascar, New Zealand, and New Caledonia). In terms of the timing of evolutionary dynamics of Corvides, these continental fragments are effectively isolated from other landmasses as oceanic islands. However, we considered continental plate (i.e., land-bridge) islands as “continents.” This type of island was connected with the mainland by a land-bridge during cyclical periods of intermittent low sea level caused by Pleistocene glacial cycles, thus allowing free movement between landmasses (e.g., New Guinea to Australia, Greater Sundas to mainland Southeast Asia). Our second trait was whether a taxon was present east or west of Wallace’s Line (Wallace 1863; Figure S2); taxa that occur on both sides of Wallace’s Line were considered “widespread.” 


First, we implemented GeoHiSSE (Caetano et al. 2018), which is a model extension of geographic state-dependent speciation and extinction (GeoSSE; (Goldberg et al. 2011)), which incorporates a hidden state to allow for the possibility of diversification changes that are due to unmeasured factors other than geography (HiSSE; (Beaulieu and O’Meara 2016)). Incorporation of a hidden state ameliorates problems with type I error that have been widely documented for SSE models, including GeoSSE (Alves et al. 2017), by redefining the null hypothesis to be one that has more complexity in diversification. We defined three main states in which extant Corvides taxa are found as either continent (state 0 as defined by figure S2), island (state 1), or both (i.e., widespread, state 01). For each of the states, we extended them to include a hidden state (denoted by A or B) to allow the data to be uncertain to test if diversification within Corvides is not associated with our measured geographic trait. We added anagenetic dispersal (d0A, d01A, d1A, d0B, d01B, and d1B), which describes dispersal from one state to the other, such as a continental taxon dispersing to an island, without a speciation event. We incorporated cladogenetic speciation, which includes within region (s0A, s1A, s0B, and s1B) speciation events. We allowed for extirpation (X0A, X1A, X0B, X1B) within endemic regions. Therefore, three classes of parameters (anagenetic dispersal, cladogenetic speciation, and extirpation) govern the transitions out of states within the GeoHiSSE model framework. Following Caetano et al. (2018), we did not allow for in situ speciation (s) or extirpation (X) for “widespread” taxa or dispersal (d) directly between endemic regions (i.e., transition from state 0 directly to state 1 is not allowed).  In addition, we extended the GeoHiSSE model (referred to as “extended GeoHiSSE”) of Caetano et al. (2018) to allow separate parameter estimations of the three types of cladogenetic range evolution that it models: allopatric cladogenesis, widespread sympatry, and subset sympatry (Fig. 1). It is important to distinguish here that “sympatry” within the GeoHiSSE model framework is different from “sympatry” in the ecological sense. Both widespread sympatry and subset sympatry range evolution relate to coarse areas. For example, two daughter lineages could occupy the same state in the GeoHiSSE model (i.e., widespread sympatry; Fig. 1) but are allopatric in nature.


To explore how the combined effect of both biogeographic barriers (insularity within island systems and Wallace’s Line) affected diversification within Corvides, we also implemented the Multistate Speciation and Extinction (MuSSE) model with a hidden state in RevBayes. We developed MuSSE and MuHiSSE (Nakov et al. 2019) models to infer diversification within four states: islands east of Wallace’s Line (0), continents east of Wallace’s Line (1), islands west of Wallace’s Line (2), and continents west of Wallace’s Line (3). We allowed for in situ speciation (s), anagenetic transitions between regions (i.e., dispersal; d), and “extinction” (X). Unlike GeoHiSSE, the MuSSE/MuHiSSE model framework does not allow for lineages to occupy the widespread (01) state. This distinction is important because GeoHiSSE’s extirpation parameter, which refers generally to an event of range reduction if in the widespread state, is inherently different than MuSSE/MuHiSSE’s extirpation parameter, which is confounded with extinction of the lineage. To limit confusion between these parameters across models, we use the term extinction for MuSSE/MuHiSSE. 


For each analysis, we ran between five and 16 independent chains to speed up computational time; total generations varied from 5.4x106 to 1.0x109 (Table S7). After assessing convergence (with ESS values >200), we combined posterior distributions in LogCombiner v.1.10.4 (Drummond and Rambaut 2007) with 25% burnin. Code for all RevBayes analyses is available at and Dryad [#TBD]. 

Usage notes

This dataset contains:

          1 nexus formatted 12-loci supermatrix alignment of 728 Corvides species and outgroup representatives. 

          1 nexus formatted alignment of the concatenated 75% complete UCE matrix for 86 genera 

          1 tree file of the final MCC tree that we present in this study 

          1 PDF file that has supplementary methods and results and 23 supplemental figures

          1 excel file that has 10 supplemental tables


National Science Foundation, Award: DEB 1557053

National Science Foundation, Award: DEB 1241181

National Science Foundation, Award: DEB 2112467