Data from: Sources of gene tree discordance and their implications for systematics and evolution of a megadiverse Australian plant radiation (subtribe Hakeinae, Proteaceae)
Data files
Oct 22, 2024 version files 65.76 MB
-
BSSB_data_and_code.zip
65.75 MB
-
README.md
8.95 KB
Dec 20, 2024 version files 65.46 MB
-
BSSB_data_and_code.zip
65.45 MB
-
README.md
9.06 KB
Abstract
Resolving phylogenetic relationships in the presence of conflicting signal across genes is one of the major challenges of the phylogenomic era. Conflicting signals can emerge from biological processes, such as incomplete lineage sorting or introgression, or have technical origins, such as from misaligned sequences. As such, decisions made in the process of estimating species trees may result in alternative tree topologies and large variation in branch support values with important systematic consequences. Here we compare alternative strategies for alignment cleaning, loci filtering, and phylogenetic estimation in 551 taxa in the Proteaceae subtribe Hakeinae, to explore how these methodological choices affect the estimation of relationships. We found that node support values across gene trees were low and gene discordance was high in the Hakeinae, particularly in lineages from the Temperate Forests biome of southeastern Australia. Higher stringency of alignment cleaning tended to decrease node support and filtering desirable loci tended to increase gene concordance. Cleaning, filtering, and phylogenetic estimation method (short-cut coalescent or concatenation) have significant effects of tree topologies with distinct clusters of similar tree topologies detected in tree space. Of note, when using concatenated approaches, the two largest Hakeinae genera, Hakea and Grevillea, were reciprocally monophyletic. However, using coalescent approaches, we regularly found that Hakea was nested within Grevillea. Our results suggest that widespread gene discordance may be the result of rapid radiation and incomplete lineage sorting, demonstrating the importance of assessing the drivers of discordance to understand phylogenetic relationships.
This README.txt file was generated 22.10.2024 by Alexander Skeels
GENERAL INFORMATION
1. Title of Dataset: Data from: 'Paleoenvironments shaped biotic exchange of terrestrial vertebrates across Wallace’s Line'
2. Author Information
Corresponding Investigator
Name: Dr Alexander Skeels
Institution: Research School of Biology, Australian National University, 46 Sullivans Creek Rd, Acton, ACT 0200, Australia
Email: alexander.skeels@gmail.com
Co-investigator 1
Name: Hervé Sauquet
Institution: National Herbarium of NSW, Botanic Gardens of Sydney, Mount Annan, NSW, Australia / Evolution and Ecology Research Centre, School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney, NSW, 2052 Australia
Co-investigator 2
Name: Austin Mast
Institution: Department of Biological Science, Florida State University, 319 Stadium Drive, Tallahassee 32306, Florida, USA
Co-investigator 3
Name: Peter H. Weston
Institution: National Herbarium of NSW, Botanic Gardens of Sydney, Mount Annan, NSW, Australia
Co-investigator 4
Name: Peter M. Olde
Institution: National Herbarium of NSW, Botanic Gardens of Sydney, Mount Annan, NSW, Australia
Co-investigator 5
Name: Zoe K.M. Reynolds
Institution: Research School of Biology, Australian National University, 46 Sullivans Creek Rd, Acton, ACT 0200, Australia
Co-investigator 6
Name: Jéssica Fenker
Institution: Research School of Biology, Australian National University, 46 Sullivans Creek Rd, Acton, ACT 0200, Australia / Sciences Department, Museums Victoria, 11 Nicholson Street, Carlton, VIC 3053, Australia
Co-investigator 7
Name: Alan Lemmon
Institution: Department of Scientific Computing, Florida State University, Tallahassee, Florida, USA
Co-investigator 8
Name: Marcel Cardillo
Institution: Research School of Biology, Australian National University, 46 Sullivans Creek Rd, Acton, ACT 0200, Australia
3. Date of data collation: 2015-2024
4. Geographic location of data collection: Australia
5. Recommended citation for this dataset: Skeels et al. (2024), Data from: Sources of gene tree discordance and their implications for systematics and evolution of a megadiverse Australian plant radiation (subtribe Hakeinae, Proteaceae)
DATA & FILE OVERVIEW
Description of dataset
These data and R scripts were produced to infer the phylogenetic relationships among Proteaceae taxa in the subtribe Hakeinae and to investigate variation in node support and gene concordance..
Files:
1. Table_S1_R1.csv: Sequence labels for Hakeinae taxa
2. Data folder: includes all data used in this study
2.1: data/AHE_March2022_TrimmedAlignments
Description: 216 Anchored Hybrid Enrichment loci for 551 Hakeinae taxa and outgroups in fasta format
2.2: data/Hakeinae_ALA_records_cleaned.csv
Description: Point occurence records for Hakeinae taxa from the Atlas of Living Australia
2.3: data/Hakeinae_biome_pam.csv
Description: Presence/Absence Matrix across biomes in bioegoragphic realms for Hakeinae taxa.
3. Scripts folder: Contins R scripts used to replicate analyses
3.1: scripts/script_0_discordance_ms_functions.R
Description: Custom functions used throughout R scripts.
3.2: scripts/script_1a_AHE_cleaning_H.R
Description: Cleaning AHE alignments with the "heavy" treatment
3.3: scripts/script_1a_AHE_cleaning_M.R
Description: Cleaning AHE alignments with the "moderate" treatment
3.4: scripts/script_1c_AHE_cleaning_summary.R
Description: Summarising cleaned AHE alignments with different statistics
3.5: scripts/script_2a_node_support_gene_trees.R
Description: Getting node support values from gene trees
3.6: scripts/script_2b_node_support_concatenated.R
Description: Getting node support valuers from concatenated trees
3.7: scripts/script_3a_filtering_gene_trees.R
Description: Filtering gene trees
3.8: scripts/script_4a_concordance_factors.R
Description: Exploring gene and site concordance factors
3.9: scripts/script_5a_tree_space.R
Description: Estimating and visualising topological differences in tree space
3.10: scripts/Script_6a_tip_concordance.R
Description: Tip concordance factors across phylogeny and geography
3.11: scripts/script_7a_introgression_tests.R
Description: ABBA-BABA tests for introgression
4. Outputs folder:
Contains 8 subfolders. Each subfolder is generated and populated using the R scripts and Data
4.1: output/alignment_summaries
Description: Two tables with summary statistics from multiple sequence alignment treatments.
4.2.1: node_support_LMH_raw.csv conatins summary metrics of node support values estimated in IQ-Tree.
Columns are as follows:
- tree = the tree ID corresponding to the position of that tree in the gene_trees set for L, M, and H alignment clenaing strategies (below);
- locus = the correpsonding locus for each gene tree
- mean_UFBoot= the mean ultrafast bootstrap value accross all nodes in the tree, estimated in IQ-Tree
- mean_SHaLRT= the mean SH-like approximate likelihood ratio test value accross all nodes in the tree, estimated in IQ-Tree
- prop_UFBoot = the proportion of ultrafast bootstrap value accross all nodes in the tree that are greater than 90, estimated in IQ-Tree
- prop_SHaLRT= the proportion of SH-like approximate likelihood ratio test value accross all nodes in the tree that are greater than 80, estimated in IQ-Tree
- all_set = the version of the cleaned alignment set (Low [L], Medium [M], and High [H])
4.2.2: alignment_summary_LMH.csv contains summary statistics for the sampling density, length, and phylogenetic infromtion in each cleaned alignment set.
Columns are as follows:
- locus = the correpsonding locus for each alignment
- n_seq = number of sequences in alignment
- n_sites= length of the alignment
- PIS_frac = fraction of sites with parsimony infromative characters
- PIS_abs= absolute number of parsimony infromative sites
- n_gaps = total number of missing data across all sequences
- prop_gaps = proportion of total alignment that is missing data
- mean_gaps = mean number of gaps per locus
- n_amb = total number of ambigously called bases across all sequences
- prop_amb = proportion of total alignment that is ambigously called bases
- mean_amb = mean number of ambigously called bases per locus
- n_invariant = number of invariant sites in each locus
- n_partial_invariant = number of partially invaraint sites in each locus
- prop_invariant = proportion of invaraint sites in each locus
- prop_partial_invariant = proportion of partially invariant sites in each locus
- version = the version of the cleaned alignment set (Low [L], Medium [M], and High [H])
4.2: output/gene_trees
Description: 14 alternative sets of gene trees for the Hakeinae and outgroups estimated in IQ-Tree with different data prepparation methods. Phylogenetic trees are in Newick format.
4.3: output/species_trees
Description: 12 species trees for the Hakeinae and outgroups. Three species trees were estimated with concatenated loci under different data cleaning strategies in IQTree (concatenated) and nine species trees were estimated from the gene tree sets with ASTRAL-III (coalescent)
4.4: output/cvR2T
Description: coefficient of variation of root-to-tip distances for each gene tree estimated from three alternative alignment sets (Low [L], Medium [M], and High [H]). The data is organised as a seperated .rds file containing a list, with each list element corresponding the the tree ID (e.g., placement of tree in the gene_trees and the "tree" column of the alignment_summary.csv)
4.5: output/gCF
Description: Gene concordance factors estimated with IQ-Tree (--gcf) and associated standard outputs.
4.6: output/sCF
Description: Site concordance factors estimated with IQ-Tree (--scf) and associated standard outputs.
4.7: output/TCF
Description: Tip Concordance Factors (TCF) null model results. This folder contains a table which shows the TCF value for each species based on each species tree with alternative phylogenetic inference methods (IQ-Tree = con), data cleaning (Low [L], Medium [M], and High [H]), and loci filtering (FL0, FL1, FL2).
4.8: output/ABBA_BABA
Description: ABBA-BABA test results on cleaning and filtering treatments. The results for the ABBA-BABA test based on each of the 12 species-trees topologies are contained in seperate .rds files. Each file contans a list, with each element representing a ABBA-BABA test on a different triad and standadr output from the method of Rancilhac et al., 2021 (see main text for details). The folder also contains two further .rds files which show the results from 10,000 ABBA-BABA tests for deeper divergences across the tree for the H-FL2 and M-FL1 treatments. These files have a suffix "_10K.rds".
5. Figure_S5
Cladograms of Hakeinae with alternative phylogenetic inference methods (ASTRAL-III and IQ-Tree), data cleaning (Low [L], Medium [M], and High [H]), and loci filtering (FL0, FL1, FL2)
6. Version Changes
Dec 2024: Added addition accession numbers in Table S1 from backlogged samples
In short, This study used 551 sequences for 482 Proteaceae taxa, of which 186 sequences were obtained from an earlier study (Cardillo et al. 2017). These 186 sequences include 151 species of Hakea, four species of Grevillea (G. dimorpha, G. evanescens, G. hookeriana, and G. batrachiodes), and Opisthiolepis heterophylla of the Hakeinae, together with eight Proteaceae outgroup taxa: Lomatia silaifolia, Stenocarpus davallioides, Alloxylon pinnatum, A. flammeum, Telopea speciosissima, Banksia paludosa, B. rufa, and Lambertia formosa. In addition, 368 samples were collected for novel DNA extraction and sequencing, sourced from fresh plant tissue obtained from wild or cultivated plants, or from dried herbarium specimens (see Table S1 for a list of samples). These included 320 species of Grevillea and Finschia chloroxantha from the Hakeinae, and additional Proteaceae outgroups: Stenocarpus milnei, Oreocallis mucronata, Macadamia integrifolia, Adenanthos glabrescens, Lambertia formosa, Isopogon formosus. Samples were sequenced using Anchored Hybrid Enrichment.