Data from: Sources of gene tree discordance and their implications for systematics and evolution of a megadiverse Australian plant radiation (subtribe Hakeinae, Proteaceae)

Name: Sources of gene tree discordance and their implications for systematics and evolution of a megadiverse Australian plant radiation (subtribe Hakeinae, Proteaceae)
Creator: Alexander Skeels

Skeels, Alexander 1

Published Oct 22, 2024; Updated Dec 20, 2024 on Dryad. https://doi.org/10.5061/dryad.6t1g1jx6b

Data files

Oct 22, 2024 version files 65.76 MB

BSSB_data_and_code.zip

65.75 MB
README.md

8.95 KB

Dec 20, 2024 version files 65.46 MB

BSSB_data_and_code.zip

65.45 MB
README.md

9.06 KB

Abstract

Resolving phylogenetic relationships in the presence of conflicting signal across genes is one of the major challenges of the phylogenomic era. Conflicting signals can emerge from biological processes, such as incomplete lineage sorting or introgression, or have technical origins, such as from misaligned sequences. As such, decisions made in the process of estimating species trees may result in alternative tree topologies and large variation in branch support values with important systematic consequences. Here we compare alternative strategies for alignment cleaning, loci filtering, and phylogenetic estimation in 551 taxa in the Proteaceae subtribe Hakeinae, to explore how these methodological choices affect the estimation of relationships. We found that node support values across gene trees were low and gene discordance was high in the Hakeinae, particularly in lineages from the Temperate Forests biome of southeastern Australia. Higher stringency of alignment cleaning tended to decrease node support and filtering desirable loci tended to increase gene concordance. Cleaning, filtering, and phylogenetic estimation method (short-cut coalescent or concatenation) have significant effects of tree topologies with distinct clusters of similar tree topologies detected in tree space. Of note, when using concatenated approaches, the two largest Hakeinae genera, Hakea and Grevillea, were reciprocally monophyletic. However, using coalescent approaches, we regularly found that Hakea was nested within Grevillea. Our results suggest that widespread gene discordance may be the result of rapid radiation and incomplete lineage sorting, demonstrating the importance of assessing the drivers of discordance to understand phylogenetic relationships.

This README.txt file was generated 22.10.2024 by Alexander Skeels

GENERAL INFORMATION

1. Title of Dataset: Data from: 'Paleoenvironments shaped biotic exchange of terrestrial vertebrates across Wallace’s Line'

2. Author Information

Corresponding Investigator
Name: Dr Alexander Skeels
Institution: Research School of Biology, Australian National University, 46 Sullivans Creek Rd, Acton, ACT 0200, Australia
Email: alexander.skeels@gmail.com

Co-investigator 1
Name: Hervé Sauquet
Institution: National Herbarium of NSW, Botanic Gardens of Sydney, Mount Annan, NSW, Australia / Evolution and Ecology Research Centre, School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney, NSW, 2052 Australia

Co-investigator 2
Name: Austin Mast
Institution: Department of Biological Science, Florida State University, 319 Stadium Drive, Tallahassee 32306, Florida, USA

Co-investigator 3
Name: Peter H. Weston
Institution: National Herbarium of NSW, Botanic Gardens of Sydney, Mount Annan, NSW, Australia

Co-investigator 4
Name: Peter M. Olde
Institution: National Herbarium of NSW, Botanic Gardens of Sydney, Mount Annan, NSW, Australia

Co-investigator 5
Name: Zoe K.M. Reynolds
Institution: Research School of Biology, Australian National University, 46 Sullivans Creek Rd, Acton, ACT 0200, Australia

Co-investigator 6
Name: Jéssica Fenker
Institution: Research School of Biology, Australian National University, 46 Sullivans Creek Rd, Acton, ACT 0200, Australia / Sciences Department, Museums Victoria, 11 Nicholson Street, Carlton, VIC 3053, Australia

Co-investigator 7
Name: Alan Lemmon
Institution: Department of Scientific Computing, Florida State University, Tallahassee, Florida, USA

Co-investigator 8
Name: Marcel Cardillo
Institution: Research School of Biology, Australian National University, 46 Sullivans Creek Rd, Acton, ACT 0200, Australia

3. Date of data collation: 2015-2024

4. Geographic location of data collection: Australia

5. Recommended citation for this dataset: Skeels et al. (2024), Data from: Sources of gene tree discordance and their implications for systematics and evolution of a megadiverse Australian plant radiation (subtribe Hakeinae, Proteaceae)

DATA & FILE OVERVIEW

Description of dataset

These data and R scripts were produced to infer the phylogenetic relationships among Proteaceae taxa in the subtribe Hakeinae and to investigate variation in node support and gene concordance..

Files:

1. Table_S1_R1.csv: Sequence labels for Hakeinae taxa

2. Data folder: includes all data used in this study

2.1: data/AHE_March2022_TrimmedAlignments
Description: 216 Anchored Hybrid Enrichment loci for 551 Hakeinae taxa and outgroups in fasta format

2.2: data/Hakeinae_ALA_records_cleaned.csv
Description: Point occurence records for Hakeinae taxa from the Atlas of Living Australia

2.3: data/Hakeinae_biome_pam.csv
Description: Presence/Absence Matrix across biomes in bioegoragphic realms for Hakeinae taxa.

3. Scripts folder: Contins R scripts used to replicate analyses

3.1: scripts/script_0_discordance_ms_functions.R
Description: Custom functions used throughout R scripts.

3.2: scripts/script_1a_AHE_cleaning_H.R
Description: Cleaning AHE alignments with the "heavy" treatment

3.3: scripts/script_1a_AHE_cleaning_M.R
Description: Cleaning AHE alignments with the "moderate" treatment

3.4: scripts/script_1c_AHE_cleaning_summary.R
Description: Summarising cleaned AHE alignments with different statistics

3.5: scripts/script_2a_node_support_gene_trees.R
Description: Getting node support values from gene trees

3.6: scripts/script_2b_node_support_concatenated.R
Description: Getting node support valuers from concatenated trees

3.7: scripts/script_3a_filtering_gene_trees.R
Description: Filtering gene trees

3.8: scripts/script_4a_concordance_factors.R
Description: Exploring gene and site concordance factors

3.9: scripts/script_5a_tree_space.R
Description: Estimating and visualising topological differences in tree space

3.10: scripts/Script_6a_tip_concordance.R
Description: Tip concordance factors across phylogeny and geography

3.11: scripts/script_7a_introgression_tests.R
Description: ABBA-BABA tests for introgression

4. Outputs folder:

Contains 8 subfolders. Each subfolder is generated and populated using the R scripts and Data

4.1: output/alignment_summaries
Description: Two tables with summary statistics from multiple sequence alignment treatments.

4.2.1: node_support_LMH_raw.csv conatins summary metrics of node support values estimated in IQ-Tree.
Columns are as follows:

tree = the tree ID corresponding to the position of that tree in the gene_trees set for L, M, and H alignment clenaing strategies (below);
locus = the correpsonding locus for each gene tree
mean_UFBoot= the mean ultrafast bootstrap value accross all nodes in the tree, estimated in IQ-Tree
mean_SHaLRT= the mean SH-like approximate likelihood ratio test value accross all nodes in the tree, estimated in IQ-Tree
prop_UFBoot = the proportion of ultrafast bootstrap value accross all nodes in the tree that are greater than 90, estimated in IQ-Tree
prop_SHaLRT= the proportion of SH-like approximate likelihood ratio test value accross all nodes in the tree that are greater than 80, estimated in IQ-Tree
all_set = the version of the cleaned alignment set (Low [L], Medium [M], and High [H])

4.2.2: alignment_summary_LMH.csv contains summary statistics for the sampling density, length, and phylogenetic infromtion in each cleaned alignment set.

Columns are as follows:

locus = the correpsonding locus for each alignment
n_seq = number of sequences in alignment
n_sites= length of the alignment
PIS_frac = fraction of sites with parsimony infromative characters
PIS_abs= absolute number of parsimony infromative sites
n_gaps = total number of missing data across all sequences
prop_gaps = proportion of total alignment that is missing data
mean_gaps = mean number of gaps per locus
n_amb = total number of ambigously called bases across all sequences
prop_amb = proportion of total alignment that is ambigously called bases
mean_amb = mean number of ambigously called bases per locus
n_invariant = number of invariant sites in each locus
n_partial_invariant = number of partially invaraint sites in each locus
prop_invariant = proportion of invaraint sites in each locus
prop_partial_invariant = proportion of partially invariant sites in each locus
version = the version of the cleaned alignment set (Low [L], Medium [M], and High [H])

4.2: output/gene_trees

Description: 14 alternative sets of gene trees for the Hakeinae and outgroups estimated in IQ-Tree with different data prepparation methods. Phylogenetic trees are in Newick format.

4.3: output/species_trees

Description: 12 species trees for the Hakeinae and outgroups. Three species trees were estimated with concatenated loci under different data cleaning strategies in IQTree (concatenated) and nine species trees were estimated from the gene tree sets with ASTRAL-III (coalescent)

4.4: output/cvR2T

Description: coefficient of variation of root-to-tip distances for each gene tree estimated from three alternative alignment sets (Low [L], Medium [M], and High [H]). The data is organised as a seperated .rds file containing a list, with each list element corresponding the the tree ID (e.g., placement of tree in the gene_trees and the "tree" column of the alignment_summary.csv)

4.5: output/gCF

Description: Gene concordance factors estimated with IQ-Tree (--gcf) and associated standard outputs.

4.6: output/sCF

Description: Site concordance factors estimated with IQ-Tree (--scf) and associated standard outputs.

4.7: output/TCF

Description: Tip Concordance Factors (TCF) null model results. This folder contains a table which shows the TCF value for each species based on each species tree with alternative phylogenetic inference methods (IQ-Tree = con), data cleaning (Low [L], Medium [M], and High [H]), and loci filtering (FL0, FL1, FL2).

4.8: output/ABBA_BABA

Description: ABBA-BABA test results on cleaning and filtering treatments. The results for the ABBA-BABA test based on each of the 12 species-trees topologies are contained in seperate .rds files. Each file contans a list, with each element representing a ABBA-BABA test on a different triad and standadr output from the method of Rancilhac et al., 2021 (see main text for details). The folder also contains two further .rds files which show the results from 10,000 ABBA-BABA tests for deeper divergences across the tree for the H-FL2 and M-FL1 treatments. These files have a suffix "_10K.rds".

5. Figure_S5

Cladograms of Hakeinae with alternative phylogenetic inference methods (ASTRAL-III and IQ-Tree), data cleaning (Low [L], Medium [M], and High [H]), and loci filtering (FL0, FL1, FL2)

6. Version Changes

Dec 2024: Added addition accession numbers in Table S1 from backlogged samples