The complex interplay between evolutionary flexibility and diversification in a family of spiders
Data files
Oct 27, 2025 version files 685.91 MB
-
data_and_scripts.zip
661.12 MB
-
README.md
61.37 KB
-
README.pdf
325.46 KB
-
Supplementary_Data_2.zip
1.26 MB
-
Supplementary_Tables.zip
415.42 KB
-
Supplementary_Text_and_Figures.pdf
22.73 MB
Abstract
Understanding the mechanisms of how morphological traits drive speciation and contribute to species richness is pivotal in evolutionary biology. In this context, the evolutionary flexibility of morphological traits may play a significant role. Using the diverse daddy long-legs spiders, Pholcidae, which currently includes almost 2,000 described species, we explore the interplay between speciation rate, trait evolution rate, microhabitat shift rate, species richness, interspecific variability of body size, leg length, relative leg length, and leg proportions. We apply a combination of large-scale genomic and taxonomic sampling, and phylogenetic and comparative analyses to assess the dynamics of diversification and evolutionary flexibility (measured as either the standard variance or disparity of traits), as well as their interactions. We found that increased evolutionary flexibility is accompanied by accelerated rates in speciation and trait evolution, and by higher species richness. We also observed near-isometry of leg length and body size in the species-rich lineages, suggesting stabilizing selection, while positive allometry in the species-poor lineages indicates directional selection. Additionally, we found a positive correlation between trait evolution rates and microhabitat shifts. We argue that environmental heterogeneity and frequent microhabitat shifts may contribute to the origin and maintenance of evolutionary flexibility, which in turn influences the organisms’ ability to exploit new resources and habitats. On the other hand, our study suggests that a lack of evolutionary flexibility, e.g., due to dwarfism or gigantism, could be an evolutionary dead end. Our findings also support the idea that taxa do not always follow Cope’s rule, which postulates that body size increases with lineage age. This study enhances our understanding of the mechanisms of diversification, demonstrating that the evolutionary flexibility of morphological traits may vary among closely related taxa and is likely to have contributed to the uneven distribution of biodiversity across the tree of life.
URL: https://doi.org/10.5061/dryad.r2280gbnh
Authors:
Guanliang Meng1, Lars Podsiadlowski1, Dimitar Dimitrov2, and Bernhard A. Huber1
1 Zoological Research Museum Alexander Koenig, LIB, Bonn, Germany
2 University Museum of Bergen, University of Bergen, Bergen, Norway
Corresponding author: Guanliang Meng (G.Meng@leibniz-lib.de); Dimitar Dimitrov (dimitard.gwu@gmail.com)
ORCiD:
Guanliang Meng: https://orcid.org/0000-0002-6488-1527
Lars Podsiadlowski: https://orcid.org/0000-0001-7786-8930
Dimitar Dimitrov: https://orcid.org/0000-0001-5830-5702
Bernhard A. Huber: https://orcid.org/0000-0002-7566-5424
1. SUPPLEMENTARY MATERIAL
1. Supplementary Text and Figures: Supplementary_Text_and_Figures.pdf
2. Supplementary Tables: Supplementary_Tables.zip
3. Data (e.g., sequences, alignments) and scripts for target capture sequencing and comparative analysis: data_and_scripts.zip
4. UCE+Sanger-data-based trees: Supplementary_Data_2.zip
5. PDF version of the README file: README.pdf.
2. Further details on the file data_and_scripts.zip
This file contains the data, scripts, and commands used by the present study.
Folder structure:
.
├── 1_script
│ └── FASconCAT-G_v1.05.pl
├── 2_phyluce
│ ├── 1_raw_data
│ ├── 2_clean_data
│ ├── 3_assembly
│ ├── 4_detect_cross_contamination
│ ├── 5_match_contigs_to_probes
│ └── 6_extract_uce_loci
├── 3_UCE_phylogeny
│ ├── 1_trimmed_msa
│ ├── 2_tree_construction
│ └── 3_dating
├── 4_long-name-short-name-rela
│ ├── June-30-2024.six-genes-samples.short2long.tsv
│ └── June-30-2024.UCE-samples.short2long.tsv
├── 5_six_gene_phylogeny
│ ├── 1_trimmed_msa
│ ├── 2_tree_construction
│ └── 3_dating
├── 6_comparative_analysis
│ ├── 0.dated_trees
│ ├── 1.diversification.BAMM
│ ├── 2.trait_evolution.BAMM.subfam-level-sampling-fraction
│ ├── 3.model_fitting
│ ├── 4.phylogenetic_signal
│ ├── 5.ancestral_state
│ ├── 6.relationships
│ ├── 7.disparity.28-05-2024
│ ├── 8.linage_through_time
│ ├── readme.txt
│ └── Trait data for Meng et al 2025.xlsx
├── 7_UCE_and_six-gene_phylogeny
│ ├── 1_unpartitioned
│ ├── 2_partitioned
│ ├── April-9-2024.code.subfam.tsv
│ ├── combined_UCE+Sanger_trees.pdf
│ └── combined.code.subfam.tsv
└── readme.txt
2.1 Folder 1_script/
This folder contains the Perl script FASconCAT-G_v1.05.pl, which was from the publication:
Kück, P., Longo, G.C. FASconCAT-G: extensive functions for multiple sequence alignment preparations concerning phylogenetic studies. Front Zool 11, 81 (2014). https://doi.org/10.1186/s12983-014-0081-x
2.2 Folder 2_phyluce/
This folder contains the commands for generating ultra-conserved elements (UCEs) from raw sequencing data using the Phyluce pipeline (Faircloth 2016).
This folder has the following subfolders:
├──1_raw_data
├──2_clean_data
├──3_assembly
├──4_detect_cross_contamination
├──5_match_contigs_to_probes
└──6_extract_uce_loci
2.2.1 Subfolder 1_raw_data/
Please download the raw data from the NCBI SRA database (BioProject No: PRJNA851037), see Supplementary Table S1 for more details.
2.2.2 Subfolder 2_clean_data/
Task: Raw data filtering.
This folder contains the command for running fastp (v0.23.1) (Chen et al. 2018), which is to filter raw data from the folder 1_raw_data and generate clean data.
Briefly, reads would be filtered out if: (1) one read contained more than ten Ns; (2) more than 40% of bases were unqualified; (3) reads were duplicates; (4) reads were shorter than 40 bp.
2.2.3 Subfolder 3_assembly/
Task: UCE contig assembly.
Description: This folder contains the command for running the executable phyluce_assembly_assemblo_spades from the Phyluce pipeline (Faircloth 2016). The result is a contig file for each sample.
Default parameters were used except for --cores 16 --memory 100 (i.e., 16 CPUs and a maximum of 100 GB RAM would be applied)
The file assembly.conf contains the "code" (a short name for each sample) and the path to the folder containing fastq 1 and fastq 2 files of each sample. Users should adjust this accordingly.
When the assembly for all samples is finished, we may copy all assembled contig files into the same directory (e.g., spades-assemblies/contigs)
2.2.4 Subfolder 4_detect_cross_contamination/
Task: Filter out potential cross-sample contaminants using Blast.
To this end, we firstly make each sequence ID unique by adding the sample "code" to the sequence ID (see the folder 1_add_sampleID_to_seq-title/), before merging all sequences of all samples into a single file (2_all-blast-all/all_assemblies.fa).
Next, the 2_all-blast-all/all_assemblies.fa file was blasted against itself using the blastn command with default parameters. Blast results with "query" and "subject" coming from the same sample are then excluded. Blast results with alignment similarity >= 99% and alignment length >= 100 bp are considered as potential contamination. However, they could also be highly conserved elements such as rDNA genes.
Finally, we used the following criteria to keep or remove potential contaminants (see the folder 3_filter_based_on_abundance/): if the abundance difference (output by SPAdes) between the query and subject sequences was less than 10X, both sequences were marked as 'removed'. If both sequences had abundances ≥ 20X and their abundance difference exceeded 10X, the sequence with lower abundance was marked as 'removed'. Any sequences marked as 'removed' were excluded from the assembly.
All the clean sequences are presented in the folder 4_clean_assemblies/.
2.2.5 Subfolder 5_match_contigs_to_probes/
Task: Match the assembled contigs to the probe set (see the fileSpiderUCE_probe_set_men13099-sup-0002-Supinfo.fasta) using the command phyluce_assembly_match_contigs_to_probes from the Phyluce pipeline.
The probe set is provided by the following publication:
Kulkarni, S., Wood, H., Lloyd, M., and Hormiga, G. (2020), Spider-specific probe set for ultraconserved elements offers new perspectives on the evolutionary history of spiders (Arachnida, Araneae), Mol. Ecol. Resour., 20 (1), 185-203.
2.2.6 Subfolder 6_extract_uce_loci/
Task: Extact homologous UCE among species using the command phyluce_assembly_get_fastas_from_match_counts from the Phyluce pipeline.
Matrices 1 to 10 with various gene and taxon coverages are created, with the help of PhyloMAd, BMGE, and RY-coding. They are placed in the folder ../3_UCE_phylogeny/1_trimmed_msa/.
2.3 Folder 3_UCE_phylogeny/
Task: Phylogenetic inference using UCE datasets.
This folder has the following subfolders:
├──1_trimmed_msa
├──2_tree_construction
└──3_dating
2.3.1 Subfolder 1_trimmed_msa/
This folder contains the sequences of Matrices 1 to 10. Each matrix has its own folder.
├── matrix_1
├── matrix_2
├── matrix_3
├── matrix_4
├── matrix_5
├── matrix_6
├── matrix_7
├── matrix_8
├── matrix_9
└── matrix_10
To create a concatenated alignment for each matrix, users should enter each directory and run the following command, for instance,
# concatenate the alignments to Phylip format
cd 3_UCE_phylogeny/1_trimmed_msa/matrix_3
perl ../../../1_script/FASconCAT-G_v1.05.pl -s -p -p -n -l
The files named FcC_supermatrix.phy and FcC_supermatrix_partition.txt will then be created, which can be used for subsequent tree construction.
2.3.2 Subfolder 2_tree_construction/
This folder contains tree construction commands and the resulting trees for each matrix. Each matrix has its own folder:
.
├── matrix_1
│ └── 1_Partitioned_ML.IQ-TREE
│ ├── iqtree.sh
│ └── Matrix-1.gblocks.IQ-TREE.partitioned.tre
├── matrix_10
│ └── 1_unPartitioned_ML.IQ-TREE
│ ├── iqtree.sh
│ └── Matrix-10.IQ-TREE.unpartitioned.tre
├── matrix_2
│ └── 1_Partitioned_ML.IQ-TREE
│ ├── iqtree.sh
│ └── Matrix-2.gblocks.IQ-TREE.partitioned.tre
├── matrix_3
│ ├── 1_Partitioned_ML.IQ-TREE
│ │ ├── iqtree.sh
│ │ └── Matrix-3.TrimAl.IQ-TREE.partitioned.clean-support.tre
│ ├── 2_Partitioned_ML.RAxML-NG
│ │ ├── Matrix-3.TrimAl.RAxML-NG.partitioned.tre
│ │ └── raxml.sh
│ ├── 3_PhyloBayes-MPI
│ │ ├── check
│ │ │ └── record.sh
│ │ ├── Matrix-3.TrimAl.phylogbayes-MPI.unpartitioned.tre
│ │ ├── run1
│ │ │ ├── b.sh
│ │ │ └── sub.sh
│ │ └── run2
│ │ ├── b.sh
│ │ └── sub.sh
│ ├── 4_ASTRAL
│ │ ├── 1.locus_tree
│ │ │ └── batch_tree.sh
│ │ └── 2_species_tree
│ │ ├── astral.sh
│ │ └── Matrix-3.TrimAl.Astral.UFBS30.clean-support.tre
│ ├── 5_SVDquartets
│ │ ├── 1_individual_msa
│ │ ├── Matrix-3.TrimAl.SVDquartets.unpartitioned.tre
│ │ ├── record.sh
│ │ └── run1
│ │ ├── cmd1.txt
│ │ └── run1.sh
│ └── 6_Constrained_partitioned
│ ├── monophyletic_other_Arteminae_plus_Artema
│ │ ├── Arteminae.tre
│ │ └── multi-partition.iqtree.sh
│ └── Ninetinae_sister_to_all_other_pholcids
│ ├── multi-partition.iqtree.sh
│ └── Ninetaine_sister_to_all_others.tre
├── matrix_4
│ └── 1_Partitioned_ML.IQ-TREE
│ ├── iqtree.sh
│ └── Matrix-4.TrimAl.IQ-TREE.partitioned.tre
├── matrix_5
│ └── 1_Partitioned_ML.IQ-TREE
│ ├── iqtree.sh
│ └── Matrix-5.TrimAl.IQ-TREE.partitioned.tre
├── matrix_6
│ └── 1_Partitioned_ML.IQ-TREE
│ ├── iqtree.sh
│ └── Matrix-6.TrimAl.IQ-TREE.partitioned.tre
├── matrix_7
│ └── 1_Partitioned_ML.IQ-TREE
│ ├── iqtree.sh
│ └── Matrix-7.IQ-TREE.partitioned.tre
├── matrix_8
│ └── 1_Partitioned_ML.IQ-TREE
│ ├── iqtree.sh
│ └── Matrix-8.IQ-TREE.partitioned.tre
└── matrix_9
└── 1_unPartitioned_ML.IQ-TREE
├── iqtree.sh
└── Matrix-9.IQ-TREE.unpartitioned.tre
Example
I will explain the tree construction steps using matrix_3/ as an example.
The folder matrix_3/ has the following content:
├── 1_Partitioned_ML.IQ-TREE
├── 2_Partitioned_ML.RAxML-NG
├── 3_PhyloBayes-MPI
├── 4_ASTRAL
├── 5_SVDquartets
└── 6_Constrained_partitioned
IQ-TREE
In the folder 1_Partitioned_ML.IQ-TREE/, we perform partitioned analysis using the maximum likelihood (ML) method IQ-TREE:
# Partitioned ML analysis with IQ-TREE
aln=../../../1_trimmed_msa/matrix_3/FcC_supermatrix.phy
partition=../../../1_trimmed_msa/matrix_3/FcC_supermatrix_partition.txt
/home/gmeng/soft/iqtree-2.2.0-Linux/bin/iqtree2 \
-s $aln \
-p $partition \
--prefix m3_partitioned \
-m MFP+MERGE \
--sampling GENESITE \
-pers 0.2 -nstop 500 -B 2000 -bnni --alrt 2000 -T 24 --runs 10 >i.log 2>i.err
Best-fit models of molecular evolution were determined by ModelFinder (Kalyaanamoorthy et al. 2017) in IQ-TREE automatically (-m MFP+MERGE). To overcome local optima during heuristics, we performed ten independent runs on each dataset, with a smaller perturbation strength (-pers 0.2) and a larger number of stop iterations (-nstop 500) (Nguyen et al. 2015; Liu et al. 2024). Branch supports were evaluated with 2000 ultrafast bootstrap (UFBoot) (Minh et al. 2013) considering potential model violations (-B 2000 -bnni --sampling GENESITE). The SH-aLRT branch test (Guindon et al. 2010) was performed using 2,000 bootstrap replicates.
RAxML-NG
In the folder 2_Partitioned_ML.RAxML-NG/, we perform ML analysis using RAxML-NG.
aln=../../../1_trimmed_msa/matrix_3/FcC_supermatrix.phy
partition=../../../1_trimmed_msa/matrix_3/FcC_supermatrix_partition.txt
sed 's#DNA#GTR+R4+I+FO#g' $partition > partition.txt
# RAxML-NG v. 1.1 released on 29.11.2021 by The Exelixis Lab.
/home/gmeng/.conda/envs/mybase/bin/raxml-ng \
--all \
--threads 8 \
--msa $aln \
--model partition.txt \
--brlen scaled \
--bs-trees autoMRE \
--tree pars{25},rand{25} \
--seed 2 \
--prefix T1
During each RAxML-NG run, the GTR+R4+I+FO model, and 25 parsimony starting trees + 25 random starting trees were used, and the number of bootstrap replicates was automatically determined (--bs-trees autoMRE) (Pattengale et al. 2009).
PhyloBayes-MPI
In the folder 3_PhyloBayes-MPI/, to model both rate-heterogeneity across sites and pattern-heterogeneity across sites (i.e., across-site compositional heterogeneity), we applied the CAT-GTR+G4 model (Lartillot and Philippe 2004) in PhyloBayes-MPI. It is less sensitive to systematic errors than classical empirical models (Lartillot et al. 2013).
module load phylobayes_mpi/1.8
export OMPI_MCA_opal_cuda_support=false
aln=../../../../1_trimmed_msa/matrix_3/FcC_supermatrix.phy
mpirun -np $NSLOTS pb_mpi -d $aln -cat -gtr -dgam 4 T1
Note: At least two runs (T1, T2) should be performed, and MCMC chain convergence must be examined:
module load phylobayes_mpi/1.8
export OMPI_MCA_opal_cuda_support=false
tracecomp -x 20000 T1.trace T2.trace > tracecomp.log 2>&1
bpcomp -x 20000 T1.treelist T2.treelist > bpcomp.log
Please check the manual of PhyloBayes-MPI to learn how to determine the convergence of MCMC chains.
ASTRAL
Unlike the previous concatenation-based method, to account for incomplete lineage sorting (ILS) and mitigate the impact of missing data, we perform two-step multi-species coalescence (MSC) tree inference using ASTRAL in the folder 4_ASTRAL/. During the process, gene trees are first inferred using IQ-TREE:
cd 3_UCE_phylogeny/2_tree_construction/matrix_3/4_ASTRAL/1.locus_tree
ls ../../../../1_trimmed_msa/matrix_3/*fasta | while read f ; do a=$(basename $f .fasta) ; mkdir -p $a ; echo "/home/gmeng/soft/iqtree-2.2.0-Linux/bin/iqtree2 -s ../$f --prefix $a -pers 0.2 -nstop 500 -B 2000 -bnni -T 2 >i.log 2>i.err" >$a/loci.iqtree.sh ; cd $a ; sh loci.iqtree.sh ; cd ../ ; done
The internal nodes with low ultra-fast bootstrap supports (<= 30) of gene trees are collapsed, and after which, a species tree is summarized:
cd 3_UCE_phylogeny/2_tree_construction/matrix_3/4_ASTRAL/2_species_tree
# cat all genes
cat ../1.locus_tree/*/*.treefile >gene_trees.txt
nw_ed gene_trees.txt 'i & b<=30' o > gene_trees.UFBS30.txt
java_lib=/home/gmeng/soft/ASTRAL-single-thread-v5.7.8/Astral/
jar=/home/gmeng/soft/ASTRAL-single-thread-v5.7.8/Astral/astral.5.7.8.jar
java \
-Djava.library.path=$java_lib \
-jar $jar \
-i gene_trees.UFBS30.txt \
-o species.UFBS30.tre \
2>species.UFBS30.tre.err
SVDquartets
The ASTRAL method is sensitive to gene tree error, for instance, due to the few informative sites in the short alignments. Therefore, we also tested for the SVDquartets, which infers a species tree based on site patterns based on MSC.
The parameters file cmd1.txt:
log file=q1.log;
svdq evalQuartets=all seed=135454568 treeInf=QFM bootstrap=multilocus loci=combined nreps=200 nthreads=24 treemodel=mscoalescent treefile=concat.SVDquartets.bs.tre replace=yes ambigs=missing plus2=yes ;
saveTrees file=SVDquartets.run3.species.tre supportValues=nodeLabels ;
Run the analysis:
/home/gmeng/soft/bin/paup4a168_centos64 ../concat_msa_nexus/concat_msa_nexus.loci-renamed.nexus < cmd1.txt
You may test different seeds and check if the topologies of the species trees are consistent.
Constrained tree search and topology test
We tested two hypotheses: (1) the traditional Arteminae (i.e., ‘other Arteminae’ + Artema) is a monophyletic group. (2) Ninetinae is sister to all other pholcids.
.
├── monophyletic_other_Arteminae_plus_Artema
│ ├── Arteminae.tre
│ └── multi-partition.iqtree.sh
└── Ninetinae_sister_to_all_other_pholcids
├── multi-partition.iqtree.sh
└── Ninetaine_sister_to_all_others.tre
2.3.3 Subfolder 3_UCE_phylogeny/3_dating/
Task: Divergence time calibration using MCMCTREE.
We followed Magalhaes et al. (2020) for the justifications of the fossil assignments; see Supplementary Table S3 for details.
The oldest representative fossil of each clade was used to calibrate its divergence from the sister group (stem-dating), following Benton and Donoghue (2006) and Magalhaes et al. (2020). The soft-bound strategy was employed. We set the age bounds of the root (Synspermiata + Hypochilidae) as follows: the minimum bound (164 million years ago, Ma) reflects the placement of the genus Eoplectreurys as a stem-Synspermiata, based on the oldest known fossil species, E. gertschi Selden & Huang (Gao and Ren 2006; Magalhaes et al. 2020); the conservative maximum bound (400 Ma) was chosen because no Araneae stem fossils older than 383 Ma are known (Magalhaes et al. 2020).
Two independent runs are performed, and the convergence of MCMC chains is examined.
For learning how to perform divergence time calibration using MCMCTREE, please refer to:
- Dos Reis M, Yang Z. Bayesian Molecular Clock Dating Using Genome-Scale Datasets. Methods Mol Biol. 2019;1910:309-330. doi: 10.1007/978-1-4939-9074-0_10. PMID: 31278669.
- https://github.com/mariodosreis/divtime
2.4 Folder 4_long-name-short-nam/
Description: These two files below indicate the relationships between the sample "code" and long sample names.
.
├── June-30-2024.six-genes-samples.short2long.tsv
└── June-30-2024.UCE-samples.short2long.tsv
2.5 Folder 5_six_gene_phylogeny/
Task: Phylogenetic tree inference using the six-gene matrices + divergence time calibration.
This folder has the following content:
.
├── 1_trimmed_msa
├── 2_tree_construction
└── 3_dating
2.5.1 Subfolder 1_trimmed_msa/
This folder contains alignments derived from 10 different trimming strategies:
.
├── trimmed_with_ClipKIT_gappy
├── trimmed_with_ClipKIT_kpi
├── trimmed_with_ClipKIT_kpi-gappy
├── trimmed_with_ClipKIT_kpi-smart-gap
├── trimmed_with_ClipKIT_kpic
├── trimmed_with_ClipKIT_kpic-gappy
├── trimmed_with_ClipKIT_kpic-smart-gap
├── trimmed_with_ClipKIT_smart-gap
├── trimmed_with_Gblocks
└── trimmed_with_TrimAl
2.5.2 Subfolder 2_tree_construction/
Task: To perform phylogenetic tree inference using the six-gene matrices.
As a constraint (the file clean.constraint.tre), we used the tree from the IQ-TREE partitioned ML analysis of matrix 3.
This folder contains the tree construction commands and the resulting trees:
.
├── trimmed_with_ClipKIT_gappy
│ ├── clean.constraint.tre
│ ├── clean.input.ultrametric.tree
│ └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_kpi
│ ├── clean.constraint.tre
│ ├── July-9-2023-v1.six-gene.ClipKIT_kpi.partitioned.IQ-TREE.rooted.treePL-dated.tre
│ └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_kpi-gappy
│ ├── clean.constraint.tre
│ ├── July-9-2023-v1.six-gene.ClipKIT_kpi-gappy.partitioned.IQ-TREE.rooted.treePL-dated.tre
│ └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_kpi-smart-gap
│ ├── clean.constraint.tre
│ ├── July-9-2023-v1.six-gene.ClipKIT_kpi-smart-gap.partitioned.IQ-TREE.rooted.treePL-dated.tre
│ └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_kpic
│ ├── clean.constraint.tre
│ ├── July-9-2023-v1.six-gene.ClipKIT_kpic.partitioned.IQ-TREE.rooted.treePL-dated.tre
│ └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_kpic-gappy
│ ├── clean.constraint.tre
│ ├── July-9-2023-v1.six-gene.ClipKIT_kpic-gappy.partitioned.IQ-TREE.rooted.treePL-dated.tre
│ └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_kpic-smart-gap
│ ├── clean.constraint.tre
│ ├── clean.input.ultrametric.tree
│ └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_smart-gap
│ ├── clean.constraint.tre
│ ├── July-9-2023-v1.six-gene.ClipKIT_smart-gap.partitioned.IQ-TREE.rooted.treePL-dated.no-outgroup-and-duplicate.no-outgroup-and-duplicate.binary.ultrametric.tre
│ └── multi-partition.iqtree.sh
├── trimmed_with_Gblocks
│ ├── clean.constraint.tre
│ ├── July-9-2023-v1.six-gene.Gblocks.partitioned.IQ-TREE.rooted.treePL-dated.tre
│ └── multi-partition.iqtree.sh
└── trimmed_with_TrimAl
├── clean.constraint.tre
├── July-9-2023-v1.six-gene.TrimAl.partitioned.IQ-TREE.rooted.treePL-dated.tre
└── multi-partition.iqtree.sh
2.5.3 Subfolder 3_dating/
Task: Calibrating divergence times in the six-gene trees based on the dated UCE-matrix3-IQ-TREE phylogeny.
Divergence times of the six-gene trees are estimated using TreePL (v1.0) (Smith and O'Meara 2012) and the R function congruify.phylo() in geiger (v2.0.11) (Pennell et al. 2014), with our dated UCE-matrix 3 tree (see the section Divergence time calibration) used as the reference tree in all analyses.
Since our reference tree is a subset of each of the target trees in terms of tip taxa and topology, the concordant nodes between the reference tree and a target tree can be easily determined by the congruify.phylo() function (Eastman et al. 2013). Thus, the divergence times of all nodes in the reference tree serve as calibration points for the target tree, which are passed to TreePL via congruify.phylo() to calibrate the node ages of the target tree (Eastman et al. 2013).
setwd('./')
library(geiger)
# the input six-gene tree
target_tree_file=
out_tree=
target_tree=read.tree(target_tree_file)
# the dated UCE-matrix3-IQ-TREE tree
ref_tree <- read.tree('../../../reference_tree.mean-time-only.with-CH-codes.tre')
res <- congruify.phylo(ref_tree, target_tree, scale='treePL', ncores = 2)
write.tree(res$phy, file=out_tree)
2.6 Folder 6_comparative_analysis/
Tasks:
- Modeling speciation/extinction dynamics as well as trait evolution
- Testing for correlations between speciation/extinction and trait evolution
We focus on four traits:
- body size, using male carapace width as a proxy;
- leg length, using the combined length of male tibia 1 + metatarsus 1 as a proxy, because these can be measured easily and precisely, and they are available from the taxonomic literature;
- relative leg length, which is the ratio of leg length to body size, and
- leg proportion, which is the ratio of male metatarsus 1 to tibia 1; it can be used as a proxy for microhabitat preference (i.e., “space”, “leaf”, and “ground”), in which leaf and space living pholcids tend to have larger leg proportion values (Eberle et al. 2018).
Note: These trait data (please refer to the file Trait data for Meng et al 2025.xlsx) have also been submitted to the MorphoBank (http://morphobank.org/permalink/?P6008). During the review process of the manuscript, reviewers may access the data by visiting https://morphobank.org/index.php/LoginReg/form, then enter the project number P6008 in the email address field and the reviewer login password (ClipKIT_smart-gap_2025). Once the manuscript is accepted, the dataset and this repository (https://doi.org/10.5061/dryad.r2280gbnh) will be released and made publicly accessible.
Input trees:
To explicitly account for the impact of topological uncertainty on our results while ensuring analytical tractability, we selected three representative trees for subsequent comparative analyses:
- a primary tree based on the ClipKIT smart-gap trimmed-MSA;
- an alternative tree based on the ClipKIT gappy trimmed-MSAs;
- another alternative tree based on the kpic-smart-gap trimmed-MSAs.
These trees were chosen based on two criteria:
- minimum number of species obviously misplaced at the subfamily level; and
- monophyly of uncontested genera.
This folder contains the following content:
.
├── 0.dated_trees
├── 1.diversification.BAMM
├── 2.trait_evolution.BAMM.subfam-level-sampling-fraction
├── 3.model_fitting
├── 4.phylogenetic_signal
├── 5.ancestral_state
├── 6.relationships
├── 7.disparity.28-05-2024
├── 8.linage_through_time
├── Trait data for Meng et al 2025.xlsx
└── readme.txt
The following books are a good start to learn comparative analysis:
- Phylogenetic Comparative Methods (by Luke J. Harmon; https://lukejharmon.github.io/pcm/pdf/phylogeneticComparativeMethods.pdf
- Phylogenetic Comparative Methods in R (by Liam J. Revell and Luke J. Harmon; http://www.phytools.org/Rbook/)
2.6.0 Subfolder 0.dated_trees/
Description: Tree files and Trait files.
For instance, for the ClipKIT_smart-gap six-gene trees:
Filename explanations:
*.body_size_and_leg_length.*: relative leg length, which is the ratio of leg length to body size.*.leg_length.*: leg length (using the combined length of male tibia 1 + metatarsus 1 as a proxy)*.m_carapace_width.*: body size (using male carapace width as a proxy)*.met_to_tibia_ratio.*: leg proportion (the ratio of male metatarsus 1 to tibia 1; it can be used as a proxy for microhabitat preference (i.e., “space”, “leaf”, and “ground”), in which leaf and space living pholcids tend to have larger leg proportion values (Eberle et al. 2018))*.microhabitat.*: the raw microhabitat data (only available for a few species)
Tree files:
$ ls 0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/*.ultrametric.tre
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.body_size_and_leg_length.final.codes.ultrametric.tre
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.leg_length.final.codes.ultrametric.tre
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.m_carapace_width.final.codes.ultrametric.tre
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.met_to_tibia_ratio.final.codes.ultrametric.tre
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.microhabitat.final.codes.ultrametric.tre
Trait files:
$ ls 0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/*.final.tsv
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.body_size_and_leg_length.final.tsv
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.leg_length.final.tsv
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.m_carapace_width.final.tsv
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.met_to_tibia_ratio.final.tsv
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.microhabitat.final.tsv
2.6.1 Subfolder 1.diversification.BAMM/
Task: Modeling speciation and extinction rate dynamics using BAMM with estimated species richness and currently known species richness. Genus-specific sampling fractions are applied.
2.6.1.1 Folder May-14-2024.Estimated/2.partitioned/treePL/
Task: Modeling speciation and extinction rate dynamics using BAMM with estimated species richness.
The analyses are run on three selected six-gene trees:
.
├── ClipKIT_gappy/
├── ClipKIT_kpic-smart-gap/
└── ClipKIT_smart-gap/
For instance, in the ClipKIT_smart-gap/ folder, there are the following folders:
.
├── 1.get_prior
├── 2.run1
└── 3.run2
We calculate the priors of related parameters (lambdaInitPrior, lambdaShiftPrior, and muInitPrior ) in the 1.get_prior/ folder:
$ cat get_prior.R
library(BAMMtools)
setBAMMpriors(read.tree("clean.input.ultrametric.tree"))
And then run the analysis in 2.run1/ and 3.run2/ after modify the parameters in the corresponding control_diversification.txt file:
$ cd ../2.run1
$ cat b.sh
/home/gmeng/soft/bamm-v2.5.0/build/bamm -c control_diversification.txt
Note that, in the control_diversification.txt file, we use:
modeltype = speciationextinction
# Specify "speciationextinction" or "trait" analysis
treefile = ../1.get_prior/clean.input.ultrametric.tree
# File name of the phylogenetic tree to be analyzed
sampleProbsFilename = ../1.get_prior/EstimatedNum.tip.group.frac.tsv
# File name containing species-specific sampling fractions
expectedNumberOfShifts = 50
# prior on the number of shifts in diversification
# Suggested values:
# expectedNumberOfShifts = 1.0 for small trees (< 500 tips)
# expectedNumberOfShifts = 10 or even 50 for large trees (> 5000 tips)
The ../1.get_prior/EstimatedNum.tip.group.frac.tsv file specifies the genus-specific sampling fractions.
Now, we have:
$ ls *nb
step2_Analysis_of_rate_shifts.ipynb
step3_rate_through_time_by_groups.ipynb
In the Jupyter Notebook file step2_Analysis_of_rate_shifts.ipynb, we check if the MCMC chains converge, draw mean phylorate plots (i.e., speciation, extinction, net diversification rates), calculate clade-specific evolutionary rates, etc., following http://bamm-project.org/documentation.html.
Furthermore, we draw rate-through-time plots in the Jupyter Notebook file step3_rate_through_time_by_groups.ipynb.
2.6.1.2 Folder May-14-2024.Current/2.partitioned/treePL/
Task: Modeling speciation and extinction rate dynamics using BAMM with currently known species richness.
Similar to May-14-2024.Estimated/2.partitioned/treePL/.
2.6.2 Subfolder 2.trait_evolution.BAMM.subfam-level-sampling-fraction/
Task: Similar to 1.diversification.BAMM/, but with subfamily-specific sampling fractions.
2.6.3 Subfolder 3.model_fitting/
Task: Perform model fitting analyses for the four traits.
.
├── 1.body_size/
├── 2.leg_length/
├── 3.leg_length.to.body_size.ratio/
└── 5.met_to_tibia.ratio/
For instance, for the relative leg length (3.leg_length.to.body_size.ratio/), we run
Rscript -e "rmarkdown::render('trait_evo_model_fitting.Rmd')"
In the trait_evo_model_fitting.Rmd file, we test multiple trait evolution models and find the best-fitted model:
EvoModels <- c("BM","OU","EB","rate_trend","lambda","kappa","delta","mean_trend","white")
fitEvoModels <- lapply(EvoModels,
FUN = function(x) fitContinuous(tree,
trait_vector, SE=NA, model = x, ncores = 8,
control = list(niter=30000, method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN", "Brent", "subplex"))))
AICs <- sapply(fitEvoModels, FUN = function(x) x$opt$aicc)
names(AICs) <- EvoModels
AICtable <- aicw(AICs)
AICtable
2.6.4 Subfolder 4.phylogenetic_signal/
Task: Calculating phylogenetic signals of the four traits with Lambda and Blomberg’s K.
For instance, check the Jupyter Notebook file: 3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/phylogenetic_signal.ipynb
2.6.5 Subfolder 5.ancestral_state/
Task: Perform ancestral state reconstruction for the four traits.
.
├── 1.body_size
├── 2.leg_length
├── 3.leg_length.to.body_size.ratio
└── 5.met_to_tibia.ratio
Let's use relative leg length and the ClipKIT_smart-gap tree as an example (5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/)::
The following R scripts are used for constructing the ancestral states of the trait based on different trait evolution models (BM, EB, OU, trend):
$ ls -l ace.*.r | awk '{print $NF}'
ace.BM.r
ace.EB.r
ace.OU.r
ace.trend.r
For instance, here we test the BM model:
tree_file <- "input.tre"
trait_file <- "input.trait.txt"
tree <- read.tree(file=tree_file)
length(tree$tip.label)
df <- read.table(trait_file)
head(df)
trait_vector0 <- as.vector(df[,2])
head(trait_vector0)
names(trait_vector0) <- df[,1]
head(trait_vector0)
matched_tree_trait_vector <- match.phylo.data(tree, trait_vector0)
trait_vector <- matched_tree_trait_vector$data
fit <- ace(trait_vector, tree, type="continuous", method="REML", CI=TRUE, model="BM")
save(fit, file='fit.BM.Rdata')
Phenograms
After obtaining the ancestral states, we draw the phenograms for all species and subfamilies, respectively. For the subfamily phenogram, we color each subfamily in the whole tree (grey background), so we can see how the trait of each subfamily evolved:
# group information specified by the file 'code.subfam.tsv'
subfam_group <- read.table('code.subfam.tsv', header = TRUE)
subfam_group_2 <- list(
Artema=subfam_group[subfam_group$subfam == "Artema", ]$code,
Priscula = subfam_group[subfam_group$subfam == "Priscula", ]$code,
Arteminae = subfam_group[subfam_group$subfam == "Arteminae", ]$code,
Modisiminae = subfam_group[subfam_group$subfam == "Modisiminae", ]$code,
Ninetinae = subfam_group[subfam_group$subfam == "Ninetinae", ]$code,
Pholcinae = subfam_group[subfam_group$subfam == "Pholcinae", ]$code,
Smeringopinae = subfam_group[subfam_group$subfam == "Smeringopinae", ]$code)
tree4 <- groupOTU(tree3, subfam_group_2, 'group')
# create a 'group' item.
pdf(file = "trait.ace.BM.phenogram.color-by-subfam.pdf",
width = 8, height = 12)
p <- ggtree(tree4, aes(color=group), yscale = "trait", alpha=.9)
revts(p) + theme_minimal()
dev.off()
However, the ancestral state reconstruction step can take a long time to finish, and I want to test different visualization methods. One solution is to first save the ancestral states into a file (e.g., the R code save(fit, file='fit.BM.Rdata') in the ace.BM.r file). In the end, we use the R script 5.ancestral_state/4.m_tibia_1.to.body_size.ratio/2.partitioned/treePL/ClipKIT_kpic-smart-gap/2nd_phenogram-highlight/2nd_draw_phenogram.highlight.r to achieve better visualization:
Linear Phenograms
$ cd 5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/2nd_phenogram-highlight
$ Rscript /path/to/5.ancestral_state/4.m_tibia_1.to.body_size.ratio/2.partitioned/treePL/ClipKIT_kpic-smart-gap/2nd_phenogram-highlight/2nd_draw_phenogram.highlight.r -f ../fit.BM.Rdata -p BM
Circular Phenograms
Using the R script 5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/recoloring_phenogram.circular/recoloring_phenogram.circular.R, we can draw circular phenograms, coloring the tip labels by subfamily :
$ cd 5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/recoloring_phenogram.circular
$ Rscript $script_file -f ../fit.BM.Rdata -p BM
Furthermore, we also separate the ancestral states for different subfamilies using the Jupyter Notebook 5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/trait_std_slop_by_group.ipynb, which generates a file named BM.internal-trait_by_group.csv.
$ head BM.internal-trait_by_group.csv
"trait","clade"
9.4,"Pholcinae"
12.7692307692308,"Pholcinae"
14.6923076923077,"Pholcinae"
15.8888888888889,"Pholcinae"
20.2307692307692,"Pholcinae"
19.9166666666667,"Pholcinae"
9.4,"Pholcinae"
7.35849056603773,"Pholcinae"
10.5454545454545,"Pholcinae"
Trait density plot
With the file BM.internal-trait_by_group.csv, we can have a look at the overall distribution of the traits. We can do this by drawing a density plot (see trait_density_plot/draw.trait_density_plot.R).
trait_file <- "../BM.internal-trait_by_group.csv"
ylab <- "Clade"
xlab <- expression(ratio == frac(leg~length, body~size)) # Customize x-axis title with ratio
# Define the colors for each clade (color-blind friendly palette)
clade_colors <- c(
"Smeringopinae" = "#0072B2", # Blue
"Pholcinae" = "#E69F00", # Orange
"Arteminae" = "#F0E442", # Yellow
"Modisiminae" = "#009E73", # Green
"Priscula" = "#D55E00", # Red
"Artema" = "#CC79A7", # Pink
"Ninetinae" = "#56B4E9", # Sky Blue
"Pholcidae" = "#984EA3" # Purple
)
## 3. trait data
trait_df <- read.csv(trait_file)
head(trait_df)
# Set the order of clades
trait_df$clade <- factor(trait_df$clade, levels = rev(c("Smeringopinae", "Pholcinae", "Arteminae", "Modisiminae", "Ninetinae", "Artema", "Priscula", "Pholcidae"))) # Specify the desired order
p <- ggplot(trait_df, aes(x = trait, y = clade, fill = clade)) +
geom_density_ridges(alpha = 0.8) +
scale_fill_manual(values = clade_colors) + # Set custom colors
theme_ridges() +
theme(
legend.position = "none", # Remove the legend
axis.title.x = element_text(size = 18, hjust = 0.5), # Set x-axis title size and center it
axis.title.y = element_text(size = 18, hjust = 0.5), # Set y-axis title size and center it
axis.text.x = element_text(size = 17), # Set x-axis text size
axis.text.y = element_text(size = 17), # Set y-axis text size
) +
labs(
x = xlab, # Customize x-axis title
y = ylab # Customize y-axis title
)
ggsave(filename = "trait_density_plot.pdf",
width = 6, height = 5,
plot = p)
Correlations between disparity and diversification rates
Task: Test for the correlations between trait disparity (measured as the standard variance of the trait values) and speciation/extinction/net diversification rates.
The 'rate' files have two versions: (1) Genus-specific sampling fractions are determined by estimated species richness; (2) Genus-specific sampling fractions are determined by currently known species. Please refer to the section 2.6.1 Subfolder 1.diversification.BAMM/. Therefore, we have the following R script files:
$ ls *r | grep trait1-trait2
Current.trait1-trait2-regression.diversification.r
Current.trait1-trait2-regression.extinction.r
Current.trait1-trait2-regression.speciation.r
Estimated.trait1-trait2-regression.diversification.r
Estimated.trait1-trait2-regression.extinction.r
Estimated.trait1-trait2-regression.speciation.r
Relationship between trait disparity and speciation rate
For instance, in the Estimated.trait1-trait2-regression.speciation.r file:
tree_file <- 'input.tre.monophyletic_subfamily.collapsed.tre'
rate_file <- '../../../../../1.diversification.BAMM/May-14-2024.Estimated/2.partitioned/treePL/ClipKIT_smart-gap/all_subfamilies_df.csv'
trait_file <- "BM.internal-trait_by_group.csv"
ylab <- "Standardized Standard Variance"
xlab <- "Standardized Speciation Rate"
# Define the colors for each clade (color-blind friendly palette)
clade_colors <- c(
"Smeringopinae" = "#0072B2", # Blue
"Pholcinae" = "#E69F00", # Orange
"Arteminae" = "#F0E442", # Yellow
"Modisiminae" = "#009E73", # Green
"Priscula" = "#D55E00", # Red
"Artema" = "#CC79A7", # Pink
"Ninetinae" = "#56B4E9", # Sky Blue
"Pholcidae" = "#984EA3" # Purple
)
# read the subfamily-level tree
subfam_tree <- read.tree(tree_file)
# read the rate file
rate_df <- read.csv(rate_file)
# it has below columns:
#'speciation_rate''extinction_rate''clade''diversification_rate'
# Calculate the mean speciation rate for each clade
rate_df_2 <- aggregate(speciation_rate ~ clade, data = rate_df, mean)
# read the trait file
trait_df <- read.csv(trait_file)
# Calculate the variance of traits within each clade
trait_variance_df <- aggregate(trait ~ clade, data = trait_df, var)
names(trait_variance_df)[2] <- "trait_variance"
trait_variance_df$standard_variance <- sqrt(trait_variance_df$trait_variance)
# now 'trait_variance_df' has three columns:
# clade trait_variance standard_variance
# Combine the trait variance and speciation rate dataframes by 'clade'
combined_df <- merge(trait_variance_df, rate_df_2, by = "clade")
# prepare trait data for correlation test
trait_1 <- combined_df[['standard_variance']]
trait_1 <- setNames(trait_1, combined_df[['clade']])
trait_1
spp <- names(trait_1)
trait_2 <- combined_df[['speciation_rate']]
trait_2 <- setNames(trait_2, combined_df[['clade']])
trait_2
# Standardize variables
trait_1 <- scale(trait_1)
trait_2 <- scale(trait_2)
# PGLS test
# BM model
corBM <- corBrownian(phy=subfam_tree,form=~spp)
pgls.fit.BM <- gls(trait_1~trait_2,correlation=corBM, method='ML')
pgls.fit.BM.summary <- summary(pgls.fit.BM)
# Extract coefficient and p-value for trait_2
coefficient_trait_2 <- pgls.fit.BM.summary$tTable[trait_2_row_index, "Value"]
p_value_trait_2 <- pgls.fit.BM.summary$tTable[trait_2_row_index, "p-value"]
...
## draw BM
pdf("Estimated.trait1_trait2.scatter.reg.PGLS.BM.std.speciation.pdf")
## set margins
par(mar = c(6.1, 6.1, 1.1, 1.1), mgp = c(4, 1, 0))
## graph scatterplot of contrasts
plot(trait_1~trait_2,
ylab=ylab,
xlab=xlab,
pch=21,
col="black", # Outline color
bg=clade_colors[spp], # Fill color based on clades
cex=2,
las=1,
cex.axis=1.5,
cex.lab=2,
bty="n")
text(trait_2, trait_1, labels = spp, cex = 1.5, pos = text_positions, xpd = FALSE)
# Add text for coefficient_trait_2 and p_value_trait_2
text(x = 0.02, y = -0.1, cex = 1.5, labels = paste("Slop :", round(coefficient_trait_2, 3), ", p-value:", round(p_value_trait_2, 3)), pos = 4, col = "#984EA3")
## add gridlines to the plot
abline(h=0,lty="dotted")
abline(v=0,lty="dotted")
## reset graphing limits of the plot to the
## x/y range of our PICs
clip(min(trait_2),max(trait_2), min(trait_1),max(trait_1))
# clip(x_min, x_max, y_min, y_max)
## graph our fitted line
# a, b : single values specifying the intercept and the slope of the line
abline(a = coef(pgls.fit.BM)[1], b = coef(pgls.fit.BM)[2],lwd=2,col="#984EA3")
dev.off()
A resulting file Estimated.trait1_trait2.scatter.reg.PGLS.BM.std.speciation.pdf will be generated, showing the relationship between speciation rate and standard variance ('disparity').
2.6.6 Subfolder 6.relationships/
Task: Using the QuaSSE/ES-Sim/PGLS method to test for the correlations between trait evolution and species diversification, among others.
This folder contains the following subfolders:
.
├── 2.trait-trait.correlation
└── 3.trait-dependent-diversification
2.6.6.1 2.trait-trait.correlation/
Task: Test for correlations between different traits using PGLS, and get trait distribution over microhabitats (leaf, space, and ground)
.
├── 1.body_size.vs.leg_length
├── 2.body_size.vs.leg_length.to.body_size.ratio
├── 3.body_size.vs.m_tibia_1.to.body_size.ratio
├── 4.leg_length.vs.m_tibia_1.to.body_size.ratio
├── 5.body_size.vs.raw-microhabitat
├── 6.body_size.vs.ratio-microhabitat
├── 7.leg_length.vs.raw-microhabitat
├── 8.leg_length.vs.ratio-microhabitat
├── 9.leg_length.to.body_size.ratio.vs.raw-microhabitat
├── 10.leg_length.to.body_size.ratio.vs.ratio-microhabitat
├── 11.m_tibia_1.to.body_size.ratio.vs.raw-microhabitat
├── 12.m_tibia_1.to.body_size.ratio.vs.ratio-microhabitat
├── 13.raw-microhabitat.vs.ratio-microhabitat
└── 14.species-richness_vs_trait_std
Here are several examples.
Allometric relationship between body size and leg length
$ cd 6.relationships/2.trait-trait.correlation/1.two-traits/1.body_size.vs.leg_length/2.partitioned/treePL/ClipKIT_smart-gap
The script file two-traits-corelation.R:
treefile <- "input.tre"
trait_file <- 'clean.trait1_trait2.txt'
ylab <- "log2(body size)"
xlab <- "log2(leg length)"
tree <- read.tree(file=treefile)
length(tree$tip.label)
tree
trait_df <- read.table(trait_file)
head(trait_df)
# prepare trait data
trait_1 <- trait_df[['V2']]
trait_1 <- setNames(trait_1, trait_df[['V1']])
head(trait_1)
trait_2 <- trait_df[['V3']]
trait_2 <- setNames(trait_2, trait_df[['V1']])
head(trait_2)
# ...
# ...
spp <- names(trait_1)
# BM model
corBM <- corBrownian(phy=tree,form=~spp)
corBM
pgls.fit.BM <- gls(trait_1~trait_2,correlation=corBM, method='ML')
pgls.fit.BM.summary <- summary(pgls.fit.BM)
pgls.fit.BM.summary
coef(pgls.fit.BM)
abs(coef(pic.fit)[1]-coef(pgls.fit.BM)[2])
summary(pic.fit)$coefficients[1,4]
summary(pgls.fit.BM)$tTable[2,4]
coef(pgls.fit.BM)
## draw BM
pdf("trait1_trait2.scatter.reg.PGLS.BM.pdf")
## set margins
par(mar=c(5.1,5.1,1.1,1.1))
## graph scatterplot of contrasts
plot(trait_1~trait_2,
ylab=ylab,
xlab=xlab,
pch=21,bg="gray",cex=1.2,las=1,
cex.axis=0.7,cex.lab=0.9,bty="n")
## add gridlines to the plot
abline(h=0,lty="dotted")
abline(v=0,lty="dotted")
## reset graphing limits of the plot to the
## x/y range of our PICs
clip(min(trait_2),max(trait_2), min(trait_1),max(trait_1))
## graph our fitted line
# a, b : single values specifying the intercept and the slope of the line
abline(a = coef(pgls.fit.BM)[1], b = coef(pgls.fit.BM)[2],lwd=2,col="red")
dev.off()
For better visualization, we can color the data points by subfamily, which can be done with the script 6.relationships/2.trait-trait.correlation/1.two-traits/1.body_size.vs.leg_length/2.partitioned/treePL/ClipKIT_smart-gap/coloring_data_points/PGLS_by_subfam.v2.1.R.
Relationship between trait disparity and species richness
$ cd 6.relationships/2.trait-trait.correlation/1.two-traits/14.species-richness_vs_trait_std/3.leg_length.to.body_size.ratio/ClipKIT_smart-gap/
We test the relationship with both estimated species richness and currently known species richness:
.
├── 1.Estimate
└── 2.Current-And-Undescribed
In the file, for instance, we have:
# smart-gap
tree_file <- "/path/to/5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/input.tre.monophyletic_subfamily.collapsed.tre"
trait_1_file <- "/path/to/5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/BM.internal-trait_by_group.csv"
trait_2_file <- "../Estimated.subfam.count.tsv"
ylab <- "Standardized Standard Variance"
xlab <- "Standardized Species richness"
# Define the colors for each clade (color-blind friendly palette)
clade_colors <- c(
"Smeringopinae" = "#0072B2", # Blue
"Pholcinae" = "#E69F00", # Orange
"Arteminae" = "#F0E442", # Yellow
"Modisiminae" = "#009E73", # Green
"Priscula" = "#D55E00", # Red
"Artema" = "#CC79A7", # Pink
"Ninetinae" = "#56B4E9", # Sky Blue
"Pholcidae" = "#984EA3" # Purple
)
## 1. subfamily-level tree
subfam_tree <- read.tree(tree_file)
subfam_tree
str(subfam_tree)
## 3. trait data
trait_1_df <- read.csv(trait_1_file)
head(trait_1_df)
## 3.1 Calculate the variance of traits within each clade
trait_1_variance_df <- aggregate(trait ~ clade, data = trait_1_df, var)
names(trait_1_variance_df)[2] <- "trait_variance"
trait_1_variance_df$standard_variance <- sqrt(trait_1_variance_df$trait_variance)
trait_1_variance_df
# now it has three columns:
# clade trait_variance standard_variance
trait_2_df <- read.table(trait_2_file, col.names = c("clade", "species_num"))
trait_2_df
combined_df <- merge(trait_1_variance_df, trait_2_df, by = "clade")
combined_df
### prepare trait data
trait_1 <- combined_df[['standard_variance']]
trait_1 <- setNames(trait_1, combined_df[['clade']])
trait_1
spp <- names(trait_1)
trait_2 <- combined_df[['species_num']]
trait_2 <- setNames(trait_2, trait_2_df[['clade']])
trait_2
# Standardize variables
trait_1 <- scale(trait_1)
trait_2 <- scale(trait_2)
# BM model
corBM <- corBrownian(phy=subfam_tree,form=~spp)
corBM
pgls.fit.BM <- gls(trait_1~trait_2,correlation=corBM, method='ML')
pgls.fit.BM.summary <- summary(pgls.fit.BM)
pgls.fit.BM.summary
# Find the row index for "trait_2"
trait_2_row_index <- which(rownames(pgls.fit.BM.summary$tTable) == "trait_2")
# Extract coefficient and p-value for trait_2
coefficient_trait_2 <- pgls.fit.BM.summary$tTable[trait_2_row_index, "Value"]
coefficient_trait_2
p_value_trait_2 <- pgls.fit.BM.summary$tTable[trait_2_row_index, "p-value"]
p_value_trait_2
# Example vector specifying text positions for each point
# 1 = below, 2 = left, 3 = above, 4 = right
text_positions <- rep(4, length(trait_1)) # Initialize all positions to 4 (right)
# Adjust positions as needed based on data
# Here you can customize this vector based on specific conditions
for (i in 1:length(trait_1)) {
if (spp[i] == "Modisiminae" || spp[i] == "Pholcinae") {
text_positions[i] = 2
}
}
## draw BM
pdf("Estimated.trait1_trait2.scatter.reg.PGLS.BM.std.pdf")
## set margins
par(mar = c(6.1, 6.1, 1.1, 1.1), mgp = c(4, 1, 0))
## graph scatterplot of contrasts
plot(trait_1~trait_2,
ylab=ylab,
xlab=xlab,
pch=21,
col="black", # Outline color
bg=clade_colors[spp], # Fill color based on clades
cex=2,
las=1,
cex.axis=1.5,
cex.lab=2,
bty="n")
text(trait_2, trait_1, labels = spp, cex = 1.5, pos = text_positions, xpd = FALSE)
# Add text for coefficient_trait_2 and p_value_trait_2
text(x = 0.02, y = -0.1, cex = 1.5, labels = paste("Slop :", round(coefficient_trait_2, 3), ", p-value:", round(p_value_trait_2, 3)), pos = 4, col = "#984EA3")
## add gridlines to the plot
abline(h=0,lty="dotted")
abline(v=0,lty="dotted")
## reset graphing limits of the plot to the
## x/y range of our PICs
clip(min(trait_2),max(trait_2), min(trait_1),max(trait_1))
# clip(x_min, x_max, y_min, y_max)
## graph our fitted line
# a, b : single values specifying the intercept and the slope of the line
abline(a = coef(pgls.fit.BM)[1], b = coef(pgls.fit.BM)[2],lwd=2,col="#984EA3")
dev.off()
Leg length distribution over microhabitats:
$ cd 6.relationships/2.trait-trait.correlation/1.two-traits/7.leg_length.vs.raw-microhabitat/2.partitioned/treePL/ClipKIT_smart-gap
pdf("trait-distribution-by-microhabitat.pdf")
cols <- c("#F76D5E", "#FFFFBF", "#72D8FF")
ggplot() +
geom_density(data=trait_df, aes(x = trait_1), colour = "red", linetype = "dashed") +
geom_density(data=trait_df, aes(x = trait_1, fill = microhabitat), alpha = 0.6) +
scale_fill_manual(values = cols) +
xlab(trait_1_name)
dev.off()
2.6.6.2 3.trait-dependent-diversification/
Task: Test for correlations between trait evolution and species diversification.
I will explain this with relative leg length-dependent diversification as an example:
$ cd 6.relationships/3.trait-dependent-diversification/1.one-trait-dependent/5.leg_length.to.body_size.ratio.vs.divrsification/2.partitioned/treePL/ClipKIT_smart-gap
QuaSSE
The analysis here basically follows the following tutorial
- Analysing diversification with diversitree (Rich FitzJohn, with GeoSSE by Emma Goldberg; https://www.zoology.ubc.ca/prog/diversitree/doc/diversitree-tutorial.pdf
The QuaSSE (Quantitative State Speciation and Extinction) model in the R package diversitree (v0.9-20) (FitzJohn 2010), which simultaneously analyzes the relationship between changes in continuous traits and diversification and can suffer from model misspecification and phylogenetic pseudoreplication (Harvey and Rabosky 2018).
A global sampling fraction was applied, since the model does not support clade-specific sampling fractions.
We fitted the speciation and extinction rates as constant (1.nodrift.mle.c.R), linear (*.mle.l.R), sigmoidal (*.mle.s.R), and hump-shaped (*.mle.h.R) functions of the traits, respectively.
We also tested the models without the drift parameter (i.e., assuming a constant mean of trait values through time).
Finally, we performed a likelihood ratio test to compare the model fits.
$ ls -l *R | awk '{print $NF}'
1.nodrift.mle.c.R
2.1.mle.s.R
2.2.mle.l.R
2.3.mle.h.R
3.1.mle.s.R
3.2.mle.l.R
3.3.mle.h.R
4.1.mle.s.R
4.2.mle.l.R
4.3.mle.h.R
The 1.nodrift.mle.c.R assumes speciation ('lambda') and extinction ('mu') rates are a constant function of trait (f.c <- make.spiders(constant.x, constant.x)).
The 2.1.mle.s.R assumes the speciation rate is a sigmoidal function of a trait, while the extinction rate is a constant function of a trait (f.s <- make.spiders(sigmoid.x, constant.x)), combined with the presence or absence of the drift parameter:
mle.s <- find.mle(nodrift(f.s), p.s, control=control, verbose=0)
mle.d.s <- find.mle(f.s, coef(mle.s, TRUE), control=control, verbose=0)
The 3.1.mle.s.R assumes the extinction rate is a sigmoidal function of a trait, while the speciation rate is a constant function of a trait (f2.s <- make.spiders(constant.x, sigmoid.x)), combined with the presence or absence of the drift parameter:
mle2.s <- find.mle(nodrift(f2.s), p2.s, control=control, verbose=0)
mle2.d.s <- find.mle(f2.s, coef(mle2.s, TRUE), control=control, verbose=0)
The 4.1.mle.s.R assumes both speciation rate and extinction rate are a sigmoidal function of a trait (f3.s <- make.spiders(sigmoid.x, sigmoid.x)), combined with the presence or absence of the drift parameter:
mle3.s <- find.mle(nodrift(f3.s), p3.s, control=control, verbose=0)
mle3.d.s <- find.mle(f3.s, coef(mle3.s, TRUE), control=control, verbose=0)
Note that the estimated parameter values from a previous step serve as the initial values of the subsequent steps (i.e., 1.nodrift.mle.c.R -> 2.1.mle.s.R -> 3.1.mle.s.R -> 4.1.mle.s.R). Each step can take a very long time to run. This is why I save the results from the previous step into an Rdata file (e.g., save.image(file='until.nodrift.mle.c.Rdata')) and load them again at the subsequent step (e.g., load('until.nodrift.mle.c.Rdata')).
Finally, model comparison is performed with the anova() function in R. Please refer to the Jupyter Notebook single_partition.annova.ipynb for details.
ES-Sim test
The tip-rate correlation ES-sim test, which does not require a fully parameterized model of diversification and trait evolution, and directly examines the correlation between trait variation across the tips of a phylogeny and tip-specific speciation rate estimates, and is thus believed to have rare false inferences (Harvey and Rabosky 2018).
We used 20,000 simulations to build the null distribution of trait-speciation associations for significance testing:
es_result <- essim(tree, trait_vector, nsim = 20000)
Please refer to https://github.com/mgharvey/ES-sim for more details.
2.6.7 Subfolder 7.disparity.28-05-2024/
Task: Draw disparity through time plots with the function dtt() for each subfamily and each trait.
For instance,
$ cd 7.disparity.28-05-2024/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap
Rscript ../../../../subfamily_disparity.r
For better visualization, I also save the related data into a file (see the file subfamily_disparity.r):
save.image(file='my_subfamily_disparity.RData')
With the data file my_subfamily_disparity.RData, we redraw the disparity through time plot using the Jupyter Notebook file redraw-subfam-disparity.ipynb. The resulting files are all_dtt_plots.colored.pdf and all_dtt_plots.r2.colored.pdf, where different subfamilies are colored with different colors.
2.6.8 Subfolder 8.linage_through_time/
Task: Draw a Lineage through time plot for the Pholcidae family using the function ltt.plot().
For instance, use the Jupyter Notebook file 6_comparative_analysis/8.linage_through_time/2.partitioned/treePL/ClipKIT_smart-gap/llt_plot.ipynb.
2.7 Folder 7_UCE_and_six-gene_phylogeny/
Task: Perform phylogenetic analysis with UCE+Sanger data.
Result file: combined_UCE+Sanger_trees.pdf
Main conclusion: The phylogenetic trees from UCE+Sanger data are highly consistent with our six-gene trees, and all misplacements of taxa are found to be represented by CO1 only; thus, our conclusions on the correlations among evolutionary flexibility and other variables are robust.
In this section, we performed:
- unpartitioned analysis based on ALL UCE loci + Sanger data, using IQ-TREE with the optimal model determined by ModelFinder (Kalyaanamoorthy et al. 2017) in IQ-TREE automatically.
- partitioned anlaysis based on 55 UCE loci + Sanger data. The 55 UCEs were selected from matrix 3, and they have the most parsimonious sites. The tree was constructed with RAxML-NG and the model GTR+R4+I+FO.
You can find the commands and alignments, resulting tree files (*.treefile) at the corresponding folders.
./
├── 1_unpartitioned
│ ├── matrix_3.gappy
│ │ ├── FcC_supermatrix.phy
│ │ ├── iqtree.sh
│ │ ├── m3_unpartitioned_gappy.longName.tre
│ │ └── m3_unpartitioned_gappy.treefile
│ ├── matrix_3.kpic-smart-gap
│ │ ├── FcC_supermatrix.phy
│ │ ├── iqtree.sh
│ │ ├── m3_unpartitioned_kpic-smart-gap.longName.tre
│ │ └── m3_unpartitioned_kpic-smart-gap.treefile
│ └── matrix_3.smart-gap
│ ├── FcC_supermatrix.phy
│ ├── iqtree.sh
│ ├── m3_unpartitioned_smart-gap.longName.tre
│ └── m3_unpartitioned_smart-gap.treefile
├── 2_partitioned
│ ├── matrix_3.55UCEs.gappy
│ │ ├── 55UCE-Sanger.partitioned.RAxML-NG.gappy.longName.tre
│ │ ├── 55UCE-Sanger.partitioned.RAxML-NG.gappy.tree
│ │ ├── merged.m3.six-gene.FcC_supermatrix_partition.txt
│ │ ├── merged.m3.six-gene.FcC_supermatrix.phy
│ │ └── raxml.sh
│ ├── matrix_3.55UCEs.kpic-smart-gap
│ │ ├── 55UCE-Sanger.partitioned.RAxML-NG.kpic-smart-gap.longName.tre
│ │ ├── 55UCE-Sanger.partitioned.RAxML-NG.kpic-smart-gap.tree
│ │ ├── merged.m3.six-gene.FcC_supermatrix_partition.txt
│ │ ├── merged.m3.six-gene.FcC_supermatrix.phy
│ │ └── raxml.sh
│ └── matrix_3.55UCEs.smart-gap
│ ├── 55UCE-Sanger.partitioned.RAxML-NG.smart-gap.longName.tre
│ ├── 55UCE-Sanger.partitioned.RAxML-NG.smart-gap.tree
│ ├── merged.m3.six-gene.FcC_supermatrix_partition.txt
│ ├── merged.m3.six-gene.FcC_supermatrix.phy
│ └── raxml.sh
├── April-9-2024.code.subfam.tsv
├── combined_UCE+Sanger_trees.pdf
└── combined.code.subfam.tsv
April-9-2024.code.subfam.tsv: relationships between the sample codes and long namescombined.code.subfam.tsv: indicating to which subfamily a sample code belongs.
