The complex interplay between evolutionary flexibility and diversification in a family of spiders

Meng, Guanliang 1 ; Podsiadlowski, Lars1; Dimitrov, Dimitar2; Huber, Bernhard A.1

Published Oct 27, 2025 on Dryad. https://doi.org/10.5061/dryad.r2280gbnh

Data files

Oct 27, 2025 version files 685.91 MB

data_and_scripts.zip

661.12 MB
README.md

61.37 KB
README.pdf

325.46 KB
Supplementary_Data_2.zip

1.26 MB
Supplementary_Tables.zip

415.42 KB
Supplementary_Text_and_Figures.pdf

22.73 MB

Abstract

Understanding the mechanisms of how morphological traits drive speciation and contribute to species richness is pivotal in evolutionary biology. In this context, the evolutionary flexibility of morphological traits may play a significant role. Using the diverse daddy long-legs spiders, Pholcidae, which currently includes almost 2,000 described species, we explore the interplay between speciation rate, trait evolution rate, microhabitat shift rate, species richness, interspecific variability of body size, leg length, relative leg length, and leg proportions. We apply a combination of large-scale genomic and taxonomic sampling, and phylogenetic and comparative analyses to assess the dynamics of diversification and evolutionary flexibility (measured as either the standard variance or disparity of traits), as well as their interactions. We found that increased evolutionary flexibility is accompanied by accelerated rates in speciation and trait evolution, and by higher species richness. We also observed near-isometry of leg length and body size in the species-rich lineages, suggesting stabilizing selection, while positive allometry in the species-poor lineages indicates directional selection. Additionally, we found a positive correlation between trait evolution rates and microhabitat shifts. We argue that environmental heterogeneity and frequent microhabitat shifts may contribute to the origin and maintenance of evolutionary flexibility, which in turn influences the organisms’ ability to exploit new resources and habitats. On the other hand, our study suggests that a lack of evolutionary flexibility, e.g., due to dwarfism or gigantism, could be an evolutionary dead end. Our findings also support the idea that taxa do not always follow Cope’s rule, which postulates that body size increases with lineage age. This study enhances our understanding of the mechanisms of diversification, demonstrating that the evolutionary flexibility of morphological traits may vary among closely related taxa and is likely to have contributed to the uneven distribution of biodiversity across the tree of life.

URL: https://doi.org/10.5061/dryad.r2280gbnh

Authors:

Guanliang Meng¹, Lars Podsiadlowski¹, Dimitar Dimitrov², and Bernhard A. Huber¹

1 Zoological Research Museum Alexander Koenig, LIB, Bonn, Germany

2 University Museum of Bergen, University of Bergen, Bergen, Norway

Corresponding author: Guanliang Meng (G.Meng@leibniz-lib.de); Dimitar Dimitrov (dimitard.gwu@gmail.com)

ORCiD:

Guanliang Meng: https://orcid.org/0000-0002-6488-1527

Lars Podsiadlowski: https://orcid.org/0000-0001-7786-8930

Dimitar Dimitrov: https://orcid.org/0000-0001-5830-5702

Bernhard A. Huber: https://orcid.org/0000-0002-7566-5424

1. SUPPLEMENTARY MATERIAL

1. Supplementary Text and Figures: Supplementary_Text_and_Figures.pdf

2. Supplementary Tables: Supplementary_Tables.zip

3. Data (e.g., sequences, alignments) and scripts for target capture sequencing and comparative analysis: data_and_scripts.zip

4. UCE+Sanger-data-based trees: Supplementary_Data_2.zip

5. PDF version of the README file: README.pdf.

2. Further details on the file `data_and_scripts.zip`

This file contains the data, scripts, and commands used by the present study.

Folder structure:

.
├── 1_script
│   └── FASconCAT-G_v1.05.pl
├── 2_phyluce
│   ├── 1_raw_data
│   ├── 2_clean_data
│   ├── 3_assembly
│   ├── 4_detect_cross_contamination
│   ├── 5_match_contigs_to_probes
│   └── 6_extract_uce_loci
├── 3_UCE_phylogeny
│   ├── 1_trimmed_msa
│   ├── 2_tree_construction
│   └── 3_dating
├── 4_long-name-short-name-rela
│   ├── June-30-2024.six-genes-samples.short2long.tsv
│   └── June-30-2024.UCE-samples.short2long.tsv
├── 5_six_gene_phylogeny
│   ├── 1_trimmed_msa
│   ├── 2_tree_construction
│   └── 3_dating
├── 6_comparative_analysis
│   ├── 0.dated_trees
│   ├── 1.diversification.BAMM
│   ├── 2.trait_evolution.BAMM.subfam-level-sampling-fraction
│   ├── 3.model_fitting
│   ├── 4.phylogenetic_signal
│   ├── 5.ancestral_state
│   ├── 6.relationships
│   ├── 7.disparity.28-05-2024
│   ├── 8.linage_through_time
│   ├── readme.txt
│   └── Trait data for Meng et al 2025.xlsx
├── 7_UCE_and_six-gene_phylogeny
│   ├── 1_unpartitioned
│   ├── 2_partitioned
│   ├── April-9-2024.code.subfam.tsv
│   ├── combined_UCE+Sanger_trees.pdf
│   └── combined.code.subfam.tsv
└── readme.txt

2.1 Folder `1_script/`

This folder contains the Perl script FASconCAT-G_v1.05.pl, which was from the publication:

Kück, P., Longo, G.C. FASconCAT-G: extensive functions for multiple sequence alignment preparations concerning phylogenetic studies. Front Zool 11, 81 (2014). https://doi.org/10.1186/s12983-014-0081-x

2.2 Folder `2_phyluce/`

This folder contains the commands for generating ultra-conserved elements (UCEs) from raw sequencing data using the Phyluce pipeline (Faircloth 2016).

This folder has the following subfolders:

├──1_raw_data
├──2_clean_data
├──3_assembly
├──4_detect_cross_contamination
├──5_match_contigs_to_probes
└──6_extract_uce_loci

2.2.1 Subfolder `1_raw_data/`

Please download the raw data from the NCBI SRA database (BioProject No: PRJNA851037), see Supplementary Table S1 for more details.

2.2.2 Subfolder `2_clean_data/`

Task: Raw data filtering.

This folder contains the command for running fastp (v0.23.1) (Chen et al. 2018), which is to filter raw data from the folder 1_raw_data and generate clean data.

Briefly, reads would be filtered out if: (1) one read contained more than ten Ns; (2) more than 40% of bases were unqualified; (3) reads were duplicates; (4) reads were shorter than 40 bp.

2.2.3 Subfolder `3_assembly/`

Task: UCE contig assembly.

Description: This folder contains the command for running the executable phyluce_assembly_assemblo_spades from the Phyluce pipeline (Faircloth 2016). The result is a contig file for each sample.

Default parameters were used except for --cores 16 --memory 100 (i.e., 16 CPUs and a maximum of 100 GB RAM would be applied)

The file assembly.conf contains the "code" (a short name for each sample) and the path to the folder containing fastq 1 and fastq 2 files of each sample. Users should adjust this accordingly.

When the assembly for all samples is finished, we may copy all assembled contig files into the same directory (e.g., spades-assemblies/contigs)

2.2.4 Subfolder `4_detect_cross_contamination/`

Task: Filter out potential cross-sample contaminants using Blast.

To this end, we firstly make each sequence ID unique by adding the sample "code" to the sequence ID (see the folder 1_add_sampleID_to_seq-title/), before merging all sequences of all samples into a single file (2_all-blast-all/all_assemblies.fa).

Next, the 2_all-blast-all/all_assemblies.fa file was blasted against itself using the blastn command with default parameters. Blast results with "query" and "subject" coming from the same sample are then excluded. Blast results with alignment similarity >= 99% and alignment length >= 100 bp are considered as potential contamination. However, they could also be highly conserved elements such as rDNA genes.

Finally, we used the following criteria to keep or remove potential contaminants (see the folder 3_filter_based_on_abundance/): if the abundance difference (output by SPAdes) between the query and subject sequences was less than 10X, both sequences were marked as 'removed'. If both sequences had abundances ≥ 20X and their abundance difference exceeded 10X, the sequence with lower abundance was marked as 'removed'. Any sequences marked as 'removed' were excluded from the assembly.

All the clean sequences are presented in the folder 4_clean_assemblies/.

2.2.5 Subfolder `5_match_contigs_to_probes/`

Task: Match the assembled contigs to the probe set (see the fileSpiderUCE_probe_set_men13099-sup-0002-Supinfo.fasta) using the command phyluce_assembly_match_contigs_to_probes from the Phyluce pipeline.

The probe set is provided by the following publication:

Kulkarni, S., Wood, H., Lloyd, M., and Hormiga, G. (2020), Spider-specific probe set for ultraconserved elements offers new perspectives on the evolutionary history of spiders (Arachnida, Araneae), Mol. Ecol. Resour., 20 (1), 185-203.

2.2.6 Subfolder `6_extract_uce_loci/`

Task: Extact homologous UCE among species using the command phyluce_assembly_get_fastas_from_match_counts from the Phyluce pipeline.

Matrices 1 to 10 with various gene and taxon coverages are created, with the help of PhyloMAd, BMGE, and RY-coding. They are placed in the folder ../3_UCE_phylogeny/1_trimmed_msa/.

2.3 Folder `3_UCE_phylogeny/`

Task: Phylogenetic inference using UCE datasets.

This folder has the following subfolders:

├──1_trimmed_msa
├──2_tree_construction
└──3_dating

2.3.1 Subfolder `1_trimmed_msa/`

This folder contains the sequences of Matrices 1 to 10. Each matrix has its own folder.

├── matrix_1
├── matrix_2
├── matrix_3
├── matrix_4
├── matrix_5
├── matrix_6
├── matrix_7
├── matrix_8
├── matrix_9
└── matrix_10

To create a concatenated alignment for each matrix, users should enter each directory and run the following command, for instance,

# concatenate the alignments to Phylip format
cd 3_UCE_phylogeny/1_trimmed_msa/matrix_3
perl ../../../1_script/FASconCAT-G_v1.05.pl -s -p -p -n -l

The files named FcC_supermatrix.phy and FcC_supermatrix_partition.txt will then be created, which can be used for subsequent tree construction.

2.3.2 Subfolder `2_tree_construction/`

This folder contains tree construction commands and the resulting trees for each matrix. Each matrix has its own folder:

.
├── matrix_1
│   └── 1_Partitioned_ML.IQ-TREE
│       ├── iqtree.sh
│       └── Matrix-1.gblocks.IQ-TREE.partitioned.tre
├── matrix_10
│   └── 1_unPartitioned_ML.IQ-TREE
│       ├── iqtree.sh
│       └── Matrix-10.IQ-TREE.unpartitioned.tre
├── matrix_2
│   └── 1_Partitioned_ML.IQ-TREE
│       ├── iqtree.sh
│       └── Matrix-2.gblocks.IQ-TREE.partitioned.tre
├── matrix_3
│   ├── 1_Partitioned_ML.IQ-TREE
│   │   ├── iqtree.sh
│   │   └── Matrix-3.TrimAl.IQ-TREE.partitioned.clean-support.tre
│   ├── 2_Partitioned_ML.RAxML-NG
│   │   ├── Matrix-3.TrimAl.RAxML-NG.partitioned.tre
│   │   └── raxml.sh
│   ├── 3_PhyloBayes-MPI
│   │   ├── check
│   │   │   └── record.sh
│   │   ├── Matrix-3.TrimAl.phylogbayes-MPI.unpartitioned.tre
│   │   ├── run1
│   │   │   ├── b.sh
│   │   │   └── sub.sh
│   │   └── run2
│   │       ├── b.sh
│   │       └── sub.sh
│   ├── 4_ASTRAL
│   │   ├── 1.locus_tree
│   │   │   └── batch_tree.sh
│   │   └── 2_species_tree
│   │       ├── astral.sh
│   │       └── Matrix-3.TrimAl.Astral.UFBS30.clean-support.tre
│   ├── 5_SVDquartets
│   │   ├── 1_individual_msa
│   │   ├── Matrix-3.TrimAl.SVDquartets.unpartitioned.tre
│   │   ├── record.sh
│   │   └── run1
│   │       ├── cmd1.txt
│   │       └── run1.sh
│   └── 6_Constrained_partitioned
│       ├── monophyletic_other_Arteminae_plus_Artema
│       │   ├── Arteminae.tre
│       │   └── multi-partition.iqtree.sh
│       └── Ninetinae_sister_to_all_other_pholcids
│           ├── multi-partition.iqtree.sh
│           └── Ninetaine_sister_to_all_others.tre
├── matrix_4
│   └── 1_Partitioned_ML.IQ-TREE
│       ├── iqtree.sh
│       └── Matrix-4.TrimAl.IQ-TREE.partitioned.tre
├── matrix_5
│   └── 1_Partitioned_ML.IQ-TREE
│       ├── iqtree.sh
│       └── Matrix-5.TrimAl.IQ-TREE.partitioned.tre
├── matrix_6
│   └── 1_Partitioned_ML.IQ-TREE
│       ├── iqtree.sh
│       └── Matrix-6.TrimAl.IQ-TREE.partitioned.tre
├── matrix_7
│   └── 1_Partitioned_ML.IQ-TREE
│       ├── iqtree.sh
│       └── Matrix-7.IQ-TREE.partitioned.tre
├── matrix_8
│   └── 1_Partitioned_ML.IQ-TREE
│       ├── iqtree.sh
│       └── Matrix-8.IQ-TREE.partitioned.tre
└── matrix_9
    └── 1_unPartitioned_ML.IQ-TREE
        ├── iqtree.sh
        └── Matrix-9.IQ-TREE.unpartitioned.tre

Example

I will explain the tree construction steps using matrix_3/ as an example.

The folder matrix_3/ has the following content:

├── 1_Partitioned_ML.IQ-TREE			
├── 2_Partitioned_ML.RAxML-NG
├── 3_PhyloBayes-MPI
├── 4_ASTRAL
├── 5_SVDquartets
└── 6_Constrained_partitioned

IQ-TREE

In the folder 1_Partitioned_ML.IQ-TREE/, we perform partitioned analysis using the maximum likelihood (ML) method IQ-TREE:

# Partitioned ML analysis with IQ-TREE

aln=../../../1_trimmed_msa/matrix_3/FcC_supermatrix.phy
partition=../../../1_trimmed_msa/matrix_3/FcC_supermatrix_partition.txt

/home/gmeng/soft/iqtree-2.2.0-Linux/bin/iqtree2 \
-s $aln \
-p $partition \
--prefix m3_partitioned \
-m MFP+MERGE \
--sampling GENESITE \
-pers 0.2 -nstop 500 -B 2000 -bnni --alrt 2000  -T 24 --runs 10 >i.log 2>i.err

Best-fit models of molecular evolution were determined by ModelFinder (Kalyaanamoorthy et al. 2017) in IQ-TREE automatically (-m MFP+MERGE). To overcome local optima during heuristics, we performed ten independent runs on each dataset, with a smaller perturbation strength (-pers 0.2) and a larger number of stop iterations (-nstop 500) (Nguyen et al. 2015; Liu et al. 2024). Branch supports were evaluated with 2000 ultrafast bootstrap (UFBoot) (Minh et al. 2013) considering potential model violations (-B 2000 -bnni --sampling GENESITE). The SH-aLRT branch test (Guindon et al. 2010) was performed using 2,000 bootstrap replicates.

RAxML-NG

In the folder 2_Partitioned_ML.RAxML-NG/, we perform ML analysis using RAxML-NG.

aln=../../../1_trimmed_msa/matrix_3/FcC_supermatrix.phy
partition=../../../1_trimmed_msa/matrix_3/FcC_supermatrix_partition.txt

sed 's#DNA#GTR+R4+I+FO#g' $partition > partition.txt

# RAxML-NG v. 1.1 released on 29.11.2021 by The Exelixis Lab.

/home/gmeng/.conda/envs/mybase/bin/raxml-ng \
--all \
--threads 8 \
--msa $aln \
--model partition.txt \
--brlen scaled \
--bs-trees autoMRE \
--tree pars{25},rand{25} \
--seed 2 \
--prefix T1

During each RAxML-NG run, the GTR+R4+I+FO model, and 25 parsimony starting trees + 25 random starting trees were used, and the number of bootstrap replicates was automatically determined (--bs-trees autoMRE) (Pattengale et al. 2009).

PhyloBayes-MPI

In the folder 3_PhyloBayes-MPI/, to model both rate-heterogeneity across sites and pattern-heterogeneity across sites (i.e., across-site compositional heterogeneity), we applied the CAT-GTR+G4 model (Lartillot and Philippe 2004) in PhyloBayes-MPI. It is less sensitive to systematic errors than classical empirical models (Lartillot et al. 2013).

module load phylobayes_mpi/1.8

export OMPI_MCA_opal_cuda_support=false

aln=../../../../1_trimmed_msa/matrix_3/FcC_supermatrix.phy

mpirun -np $NSLOTS pb_mpi -d $aln -cat -gtr -dgam 4 T1

Note: At least two runs (T1, T2) should be performed, and MCMC chain convergence must be examined:

module load phylobayes_mpi/1.8

export OMPI_MCA_opal_cuda_support=false

tracecomp -x 20000 T1.trace T2.trace > tracecomp.log 2>&1

bpcomp -x 20000 T1.treelist T2.treelist > bpcomp.log

Please check the manual of PhyloBayes-MPI to learn how to determine the convergence of MCMC chains.

ASTRAL

Unlike the previous concatenation-based method, to account for incomplete lineage sorting (ILS) and mitigate the impact of missing data, we perform two-step multi-species coalescence (MSC) tree inference using ASTRAL in the folder 4_ASTRAL/. During the process, gene trees are first inferred using IQ-TREE:

cd 3_UCE_phylogeny/2_tree_construction/matrix_3/4_ASTRAL/1.locus_tree

ls ../../../../1_trimmed_msa/matrix_3/*fasta | while read f ; do a=$(basename $f .fasta) ; mkdir -p $a ; echo "/home/gmeng/soft/iqtree-2.2.0-Linux/bin/iqtree2 -s ../$f --prefix $a -pers 0.2 -nstop 500 -B 2000 -bnni -T 2  >i.log 2>i.err" >$a/loci.iqtree.sh ; cd $a ; sh loci.iqtree.sh ; cd ../ ; done

The internal nodes with low ultra-fast bootstrap supports (<= 30) of gene trees are collapsed, and after which, a species tree is summarized:

cd 3_UCE_phylogeny/2_tree_construction/matrix_3/4_ASTRAL/2_species_tree

# cat all genes
cat  ../1.locus_tree/*/*.treefile >gene_trees.txt

nw_ed gene_trees.txt 'i & b<=30' o > gene_trees.UFBS30.txt

java_lib=/home/gmeng/soft/ASTRAL-single-thread-v5.7.8/Astral/
jar=/home/gmeng/soft/ASTRAL-single-thread-v5.7.8/Astral/astral.5.7.8.jar

java \
-Djava.library.path=$java_lib  \
-jar $jar \
-i gene_trees.UFBS30.txt \
-o species.UFBS30.tre \
2>species.UFBS30.tre.err

SVDquartets

The ASTRAL method is sensitive to gene tree error, for instance, due to the few informative sites in the short alignments. Therefore, we also tested for the SVDquartets, which infers a species tree based on site patterns based on MSC.

The parameters file cmd1.txt:

log file=q1.log;
svdq evalQuartets=all seed=135454568 treeInf=QFM bootstrap=multilocus loci=combined nreps=200 nthreads=24 treemodel=mscoalescent  treefile=concat.SVDquartets.bs.tre replace=yes ambigs=missing plus2=yes ;
saveTrees file=SVDquartets.run3.species.tre supportValues=nodeLabels ;

Run the analysis:

/home/gmeng/soft/bin/paup4a168_centos64 ../concat_msa_nexus/concat_msa_nexus.loci-renamed.nexus < cmd1.txt

You may test different seeds and check if the topologies of the species trees are consistent.

Constrained tree search and topology test

We tested two hypotheses: (1) the traditional Arteminae (i.e., ‘other Arteminae’ + Artema) is a monophyletic group. (2) Ninetinae is sister to all other pholcids.

.
├── monophyletic_other_Arteminae_plus_Artema
│   ├── Arteminae.tre
│   └── multi-partition.iqtree.sh
└── Ninetinae_sister_to_all_other_pholcids
    ├── multi-partition.iqtree.sh
    └── Ninetaine_sister_to_all_others.tre

2.3.3 Subfolder `3_UCE_phylogeny/3_dating/`

Task: Divergence time calibration using MCMCTREE.

We followed Magalhaes et al. (2020) for the justifications of the fossil assignments; see Supplementary Table S3 for details.

The oldest representative fossil of each clade was used to calibrate its divergence from the sister group (stem-dating), following Benton and Donoghue (2006) and Magalhaes et al. (2020). The soft-bound strategy was employed. We set the age bounds of the root (Synspermiata + Hypochilidae) as follows: the minimum bound (164 million years ago, Ma) reflects the placement of the genus Eoplectreurys as a stem-Synspermiata, based on the oldest known fossil species, E. gertschi Selden & Huang (Gao and Ren 2006; Magalhaes et al. 2020); the conservative maximum bound (400 Ma) was chosen because no Araneae stem fossils older than 383 Ma are known (Magalhaes et al. 2020).

Two independent runs are performed, and the convergence of MCMC chains is examined.

For learning how to perform divergence time calibration using MCMCTREE, please refer to:

Dos Reis M, Yang Z. Bayesian Molecular Clock Dating Using Genome-Scale Datasets. Methods Mol Biol. 2019;1910:309-330. doi: 10.1007/978-1-4939-9074-0_10. PMID: 31278669.
https://github.com/mariodosreis/divtime

2.4 Folder `4_long-name-short-nam/`

Description: These two files below indicate the relationships between the sample "code" and long sample names.

.
├── June-30-2024.six-genes-samples.short2long.tsv
└── June-30-2024.UCE-samples.short2long.tsv

2.5 Folder `5_six_gene_phylogeny/`

Task: Phylogenetic tree inference using the six-gene matrices + divergence time calibration.

This folder has the following content:

.
├── 1_trimmed_msa
├── 2_tree_construction
└── 3_dating

2.5.1 Subfolder `1_trimmed_msa`/

This folder contains alignments derived from 10 different trimming strategies:

.
├── trimmed_with_ClipKIT_gappy
├── trimmed_with_ClipKIT_kpi
├── trimmed_with_ClipKIT_kpi-gappy
├── trimmed_with_ClipKIT_kpi-smart-gap
├── trimmed_with_ClipKIT_kpic
├── trimmed_with_ClipKIT_kpic-gappy
├── trimmed_with_ClipKIT_kpic-smart-gap
├── trimmed_with_ClipKIT_smart-gap
├── trimmed_with_Gblocks
└── trimmed_with_TrimAl

2.5.2 Subfolder `2_tree_construction/`

Task: To perform phylogenetic tree inference using the six-gene matrices.

As a constraint (the file clean.constraint.tre), we used the tree from the IQ-TREE partitioned ML analysis of matrix 3.

This folder contains the tree construction commands and the resulting trees:

.
├── trimmed_with_ClipKIT_gappy
│   ├── clean.constraint.tre
│   ├── clean.input.ultrametric.tree
│   └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_kpi
│   ├── clean.constraint.tre
│   ├── July-9-2023-v1.six-gene.ClipKIT_kpi.partitioned.IQ-TREE.rooted.treePL-dated.tre
│   └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_kpi-gappy
│   ├── clean.constraint.tre
│   ├── July-9-2023-v1.six-gene.ClipKIT_kpi-gappy.partitioned.IQ-TREE.rooted.treePL-dated.tre
│   └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_kpi-smart-gap
│   ├── clean.constraint.tre
│   ├── July-9-2023-v1.six-gene.ClipKIT_kpi-smart-gap.partitioned.IQ-TREE.rooted.treePL-dated.tre
│   └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_kpic
│   ├── clean.constraint.tre
│   ├── July-9-2023-v1.six-gene.ClipKIT_kpic.partitioned.IQ-TREE.rooted.treePL-dated.tre
│   └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_kpic-gappy
│   ├── clean.constraint.tre
│   ├── July-9-2023-v1.six-gene.ClipKIT_kpic-gappy.partitioned.IQ-TREE.rooted.treePL-dated.tre
│   └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_kpic-smart-gap
│   ├── clean.constraint.tre
│   ├── clean.input.ultrametric.tree
│   └── multi-partition.iqtree.sh
├── trimmed_with_ClipKIT_smart-gap
│   ├── clean.constraint.tre
│   ├── July-9-2023-v1.six-gene.ClipKIT_smart-gap.partitioned.IQ-TREE.rooted.treePL-dated.no-outgroup-and-duplicate.no-outgroup-and-duplicate.binary.ultrametric.tre
│   └── multi-partition.iqtree.sh
├── trimmed_with_Gblocks
│   ├── clean.constraint.tre
│   ├── July-9-2023-v1.six-gene.Gblocks.partitioned.IQ-TREE.rooted.treePL-dated.tre
│   └── multi-partition.iqtree.sh
└── trimmed_with_TrimAl
    ├── clean.constraint.tre
    ├── July-9-2023-v1.six-gene.TrimAl.partitioned.IQ-TREE.rooted.treePL-dated.tre
    └── multi-partition.iqtree.sh

2.5.3 Subfolder `3_dating/`

Task: Calibrating divergence times in the six-gene trees based on the dated UCE-matrix3-IQ-TREE phylogeny.

Divergence times of the six-gene trees are estimated using TreePL (v1.0) (Smith and O'Meara 2012) and the R function congruify.phylo() in geiger (v2.0.11) (Pennell et al. 2014), with our dated UCE-matrix 3 tree (see the section Divergence time calibration) used as the reference tree in all analyses.

Since our reference tree is a subset of each of the target trees in terms of tip taxa and topology, the concordant nodes between the reference tree and a target tree can be easily determined by the congruify.phylo() function (Eastman et al. 2013). Thus, the divergence times of all nodes in the reference tree serve as calibration points for the target tree, which are passed to TreePL via congruify.phylo() to calibrate the node ages of the target tree (Eastman et al. 2013).

setwd('./')

library(geiger)

# the input six-gene tree
target_tree_file=
out_tree=

target_tree=read.tree(target_tree_file)

# the dated UCE-matrix3-IQ-TREE tree
ref_tree <- read.tree('../../../reference_tree.mean-time-only.with-CH-codes.tre')

res <- congruify.phylo(ref_tree, target_tree, scale='treePL', ncores = 2)
write.tree(res$phy, file=out_tree)

2.6 Folder `6_comparative_analysis/`

Tasks:

Modeling speciation/extinction dynamics as well as trait evolution
Testing for correlations between speciation/extinction and trait evolution

We focus on four traits:

body size, using male carapace width as a proxy;
leg length, using the combined length of male tibia 1 + metatarsus 1 as a proxy, because these can be measured easily and precisely, and they are available from the taxonomic literature;
relative leg length, which is the ratio of leg length to body size, and
leg proportion, which is the ratio of male metatarsus 1 to tibia 1; it can be used as a proxy for microhabitat preference (i.e., “space”, “leaf”, and “ground”), in which leaf and space living pholcids tend to have larger leg proportion values (Eberle et al. 2018).

Note: These trait data (please refer to the file Trait data for Meng et al 2025.xlsx) have also been submitted to the MorphoBank (http://morphobank.org/permalink/?P6008). During the review process of the manuscript, reviewers may access the data by visiting https://morphobank.org/index.php/LoginReg/form, then enter the project number P6008 in the email address field and the reviewer login password (ClipKIT_smart-gap_2025). Once the manuscript is accepted, the dataset and this repository (https://doi.org/10.5061/dryad.r2280gbnh) will be released and made publicly accessible.

Input trees:
To explicitly account for the impact of topological uncertainty on our results while ensuring analytical tractability, we selected three representative trees for subsequent comparative analyses:

a primary tree based on the ClipKIT smart-gap trimmed-MSA;
an alternative tree based on the ClipKIT gappy trimmed-MSAs;
another alternative tree based on the kpic-smart-gap trimmed-MSAs.

These trees were chosen based on two criteria:

minimum number of species obviously misplaced at the subfamily level; and
monophyly of uncontested genera.

This folder contains the following content:

.
├── 0.dated_trees
├── 1.diversification.BAMM
├── 2.trait_evolution.BAMM.subfam-level-sampling-fraction
├── 3.model_fitting
├── 4.phylogenetic_signal
├── 5.ancestral_state
├── 6.relationships
├── 7.disparity.28-05-2024
├── 8.linage_through_time
├── Trait data for Meng et al 2025.xlsx
└── readme.txt

The following books are a good start to learn comparative analysis:

Phylogenetic Comparative Methods (by Luke J. Harmon; https://lukejharmon.github.io/pcm/pdf/phylogeneticComparativeMethods.pdf
Phylogenetic Comparative Methods in R (by Liam J. Revell and Luke J. Harmon; http://www.phytools.org/Rbook/)

2.6.0 Subfolder `0.dated_trees/`

Description: Tree files and Trait files.

For instance, for the ClipKIT_smart-gap six-gene trees:

Filename explanations:

*.body_size_and_leg_length.*: relative leg length, which is the ratio of leg length to body size.
*.leg_length.*: leg length (using the combined length of male tibia 1 + metatarsus 1 as a proxy)
*.m_carapace_width.*: body size (using male carapace width as a proxy)
*.met_to_tibia_ratio.*: leg proportion (the ratio of male metatarsus 1 to tibia 1; it can be used as a proxy for microhabitat preference (i.e., “space”, “leaf”, and “ground”), in which leaf and space living pholcids tend to have larger leg proportion values (Eberle et al. 2018))
*.microhabitat.*: the raw microhabitat data (only available for a few species)

Tree files:

$ ls 0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/*.ultrametric.tre
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.body_size_and_leg_length.final.codes.ultrametric.tre
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.leg_length.final.codes.ultrametric.tre
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.m_carapace_width.final.codes.ultrametric.tre
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.met_to_tibia_ratio.final.codes.ultrametric.tre
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.microhabitat.final.codes.ultrametric.tre

Trait files:

$ ls 0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/*.final.tsv
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.body_size_and_leg_length.final.tsv
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.leg_length.final.tsv
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.m_carapace_width.final.tsv
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.met_to_tibia_ratio.final.tsv
0.dated_trees/2.partitioned/treePL/step5.maybe.non-monophyletic.only_tips_with_traits/ClipKIT_smart-gap/used.code.sp.microhabitat.final.tsv

2.6.1 Subfolder `1.diversification.BAMM/`

Task: Modeling speciation and extinction rate dynamics using BAMM with estimated species richness and currently known species richness. Genus-specific sampling fractions are applied.

2.6.1.1 Folder `May-14-2024.Estimated/2.partitioned/treePL/`

Task: Modeling speciation and extinction rate dynamics using BAMM with estimated species richness.

The analyses are run on three selected six-gene trees:

.
├── ClipKIT_gappy/
├── ClipKIT_kpic-smart-gap/
└── ClipKIT_smart-gap/

For instance, in the ClipKIT_smart-gap/ folder, there are the following folders:

.
├── 1.get_prior
├── 2.run1
└── 3.run2

We calculate the priors of related parameters (lambdaInitPrior, lambdaShiftPrior, and muInitPrior ) in the 1.get_prior/ folder:

$ cat get_prior.R
library(BAMMtools)
setBAMMpriors(read.tree("clean.input.ultrametric.tree"))

And then run the analysis in 2.run1/ and 3.run2/ after modify the parameters in the corresponding control_diversification.txt file:

$ cd ../2.run1
$ cat b.sh
/home/gmeng/soft/bamm-v2.5.0/build/bamm -c control_diversification.txt

Note that, in the control_diversification.txt file, we use:

modeltype = speciationextinction
# Specify "speciationextinction" or "trait" analysis

treefile = ../1.get_prior/clean.input.ultrametric.tree
# File name of the phylogenetic tree to be analyzed

sampleProbsFilename = ../1.get_prior/EstimatedNum.tip.group.frac.tsv
# File name containing species-specific sampling fractions

expectedNumberOfShifts = 50
# prior on the number of shifts in diversification
# Suggested values:
#     expectedNumberOfShifts = 1.0 for small trees (< 500 tips)
#         expectedNumberOfShifts = 10 or even 50 for large trees (> 5000 tips)

The ../1.get_prior/EstimatedNum.tip.group.frac.tsv file specifies the genus-specific sampling fractions.

Now, we have:

$ ls *nb
step2_Analysis_of_rate_shifts.ipynb	
step3_rate_through_time_by_groups.ipynb

In the Jupyter Notebook file step2_Analysis_of_rate_shifts.ipynb, we check if the MCMC chains converge, draw mean phylorate plots (i.e., speciation, extinction, net diversification rates), calculate clade-specific evolutionary rates, etc., following http://bamm-project.org/documentation.html.

Furthermore, we draw rate-through-time plots in the Jupyter Notebook file step3_rate_through_time_by_groups.ipynb.

2.6.1.2 Folder `May-14-2024.Current/2.partitioned/treePL/`

Task: Modeling speciation and extinction rate dynamics using BAMM with currently known species richness.

Similar to May-14-2024.Estimated/2.partitioned/treePL/.

2.6.2 Subfolder `2.trait_evolution.BAMM.subfam-level-sampling-fraction/`

Task: Similar to 1.diversification.BAMM/, but with subfamily-specific sampling fractions.

2.6.3 Subfolder `3.model_fitting/`

Task: Perform model fitting analyses for the four traits.

.
├── 1.body_size/
├── 2.leg_length/
├── 3.leg_length.to.body_size.ratio/
└── 5.met_to_tibia.ratio/

For instance, for the relative leg length (3.leg_length.to.body_size.ratio/), we run

Rscript -e "rmarkdown::render('trait_evo_model_fitting.Rmd')"

In the trait_evo_model_fitting.Rmd file, we test multiple trait evolution models and find the best-fitted model:

EvoModels <- c("BM","OU","EB","rate_trend","lambda","kappa","delta","mean_trend","white")

fitEvoModels <- lapply(EvoModels,
                       FUN = function(x) fitContinuous(tree,
                                                       trait_vector, SE=NA, model = x, ncores = 8,
                                                       control = list(niter=30000, method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN", "Brent", "subplex"))))

AICs <- sapply(fitEvoModels, FUN = function(x) x$opt$aicc)
names(AICs) <- EvoModels
AICtable <- aicw(AICs)
AICtable

2.6.4 Subfolder `4.phylogenetic_signal/`

Task: Calculating phylogenetic signals of the four traits with Lambda and Blomberg’s K.

For instance, check the Jupyter Notebook file: 3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/phylogenetic_signal.ipynb

2.6.5 Subfolder `5.ancestral_state/`

Task: Perform ancestral state reconstruction for the four traits.

.
├── 1.body_size
├── 2.leg_length
├── 3.leg_length.to.body_size.ratio
└── 5.met_to_tibia.ratio

Let's use relative leg length and the ClipKIT_smart-gap tree as an example (5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/)::

The following R scripts are used for constructing the ancestral states of the trait based on different trait evolution models (BM, EB, OU, trend):

$ ls -l ace.*.r | awk '{print $NF}'
ace.BM.r
ace.EB.r
ace.OU.r
ace.trend.r

For instance, here we test the BM model:

tree_file <- "input.tre"
trait_file <- "input.trait.txt"

tree <- read.tree(file=tree_file)
length(tree$tip.label)

df <- read.table(trait_file)
head(df)

trait_vector0 <- as.vector(df[,2])
head(trait_vector0)

names(trait_vector0)  <- df[,1]
head(trait_vector0)

matched_tree_trait_vector <- match.phylo.data(tree, trait_vector0)
trait_vector <- matched_tree_trait_vector$data

fit <- ace(trait_vector, tree, type="continuous", method="REML", CI=TRUE, model="BM")

save(fit, file='fit.BM.Rdata')

Phenograms

After obtaining the ancestral states, we draw the phenograms for all species and subfamilies, respectively. For the subfamily phenogram, we color each subfamily in the whole tree (grey background), so we can see how the trait of each subfamily evolved:

# group information specified by the file 'code.subfam.tsv'
subfam_group <- read.table('code.subfam.tsv', header = TRUE)

subfam_group_2 <- list(
  Artema=subfam_group[subfam_group$subfam == "Artema", ]$code,
  Priscula = subfam_group[subfam_group$subfam == "Priscula", ]$code,
  Arteminae = subfam_group[subfam_group$subfam == "Arteminae", ]$code,
  Modisiminae = subfam_group[subfam_group$subfam == "Modisiminae", ]$code,
  Ninetinae = subfam_group[subfam_group$subfam == "Ninetinae", ]$code,
  Pholcinae = subfam_group[subfam_group$subfam == "Pholcinae", ]$code,
  Smeringopinae = subfam_group[subfam_group$subfam == "Smeringopinae", ]$code)

tree4 <- groupOTU(tree3, subfam_group_2, 'group')
# create a 'group' item.

pdf(file = "trait.ace.BM.phenogram.color-by-subfam.pdf",
    width = 8, height = 12)
p <- ggtree(tree4, aes(color=group), yscale = "trait", alpha=.9)
revts(p) + theme_minimal()
dev.off()

However, the ancestral state reconstruction step can take a long time to finish, and I want to test different visualization methods. One solution is to first save the ancestral states into a file (e.g., the R code save(fit, file='fit.BM.Rdata') in the ace.BM.r file). In the end, we use the R script 5.ancestral_state/4.m_tibia_1.to.body_size.ratio/2.partitioned/treePL/ClipKIT_kpic-smart-gap/2nd_phenogram-highlight/2nd_draw_phenogram.highlight.r to achieve better visualization:

Linear Phenograms

$ cd 5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/2nd_phenogram-highlight

$ Rscript /path/to/5.ancestral_state/4.m_tibia_1.to.body_size.ratio/2.partitioned/treePL/ClipKIT_kpic-smart-gap/2nd_phenogram-highlight/2nd_draw_phenogram.highlight.r -f ../fit.BM.Rdata -p BM

Circular Phenograms

Using the R script 5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/recoloring_phenogram.circular/recoloring_phenogram.circular.R, we can draw circular phenograms, coloring the tip labels by subfamily :

$ cd 5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/recoloring_phenogram.circular

$ Rscript $script_file -f ../fit.BM.Rdata -p BM

Furthermore, we also separate the ancestral states for different subfamilies using the Jupyter Notebook 5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/trait_std_slop_by_group.ipynb, which generates a file named BM.internal-trait_by_group.csv.

$ head  BM.internal-trait_by_group.csv
"trait","clade"
9.4,"Pholcinae"
12.7692307692308,"Pholcinae"
14.6923076923077,"Pholcinae"
15.8888888888889,"Pholcinae"
20.2307692307692,"Pholcinae"
19.9166666666667,"Pholcinae"
9.4,"Pholcinae"
7.35849056603773,"Pholcinae"
10.5454545454545,"Pholcinae"

Trait density plot

With the file BM.internal-trait_by_group.csv, we can have a look at the overall distribution of the traits. We can do this by drawing a density plot (see trait_density_plot/draw.trait_density_plot.R).

trait_file <- "../BM.internal-trait_by_group.csv"

ylab <- "Clade"
xlab <- expression(ratio == frac(leg~length, body~size))  # Customize x-axis title with ratio

# Define the colors for each clade (color-blind friendly palette)
clade_colors <- c(
  "Smeringopinae" = "#0072B2",  # Blue
  "Pholcinae" = "#E69F00",      # Orange
  "Arteminae" = "#F0E442",      # Yellow
  "Modisiminae" = "#009E73",    # Green
  "Priscula" = "#D55E00",       # Red
  "Artema" = "#CC79A7",         # Pink
  "Ninetinae" = "#56B4E9",      # Sky Blue
  "Pholcidae" = "#984EA3"       # Purple
)

## 3. trait data
trait_df <- read.csv(trait_file)
head(trait_df)


# Set the order of clades
trait_df$clade <- factor(trait_df$clade, levels = rev(c("Smeringopinae", "Pholcinae", "Arteminae", "Modisiminae", "Ninetinae", "Artema", "Priscula", "Pholcidae"))) # Specify the desired order

p <- ggplot(trait_df, aes(x = trait, y = clade, fill = clade)) +
  geom_density_ridges(alpha = 0.8) +
  scale_fill_manual(values = clade_colors) + # Set custom colors
  theme_ridges() +
  theme(
    legend.position = "none", # Remove the legend
    axis.title.x = element_text(size = 18, hjust = 0.5), # Set x-axis title size and center it
    axis.title.y = element_text(size = 18, hjust = 0.5), # Set y-axis title size and center it
    axis.text.x = element_text(size = 17), # Set x-axis text size
    axis.text.y = element_text(size = 17), # Set y-axis text size
  ) +
  labs(
    x = xlab,  # Customize x-axis title
    y = ylab   # Customize y-axis title
  )

ggsave(filename = "trait_density_plot.pdf",
       width = 6, height = 5,
       plot = p)

Correlations between disparity and diversification rates

Task: Test for the correlations between trait disparity (measured as the standard variance of the trait values) and speciation/extinction/net diversification rates.

The 'rate' files have two versions: (1) Genus-specific sampling fractions are determined by estimated species richness; (2) Genus-specific sampling fractions are determined by currently known species. Please refer to the section 2.6.1 Subfolder 1.diversification.BAMM/. Therefore, we have the following R script files:

$ ls *r | grep trait1-trait2
Current.trait1-trait2-regression.diversification.r
Current.trait1-trait2-regression.extinction.r
Current.trait1-trait2-regression.speciation.r
Estimated.trait1-trait2-regression.diversification.r
Estimated.trait1-trait2-regression.extinction.r
Estimated.trait1-trait2-regression.speciation.r

Relationship between trait disparity and speciation rate
For instance, in the Estimated.trait1-trait2-regression.speciation.r file:

tree_file <- 'input.tre.monophyletic_subfamily.collapsed.tre'
rate_file <- '../../../../../1.diversification.BAMM/May-14-2024.Estimated/2.partitioned/treePL/ClipKIT_smart-gap/all_subfamilies_df.csv'
trait_file <- "BM.internal-trait_by_group.csv"


ylab <- "Standardized Standard Variance"
xlab <- "Standardized Speciation Rate"

# Define the colors for each clade (color-blind friendly palette)
clade_colors <- c(
  "Smeringopinae" = "#0072B2",  # Blue
  "Pholcinae" = "#E69F00",      # Orange
  "Arteminae" = "#F0E442",      # Yellow
  "Modisiminae" = "#009E73",    # Green
  "Priscula" = "#D55E00",       # Red
  "Artema" = "#CC79A7",         # Pink
  "Ninetinae" = "#56B4E9",      # Sky Blue
  "Pholcidae" = "#984EA3"       # Purple
)

# read the subfamily-level tree
subfam_tree <- read.tree(tree_file)

# read the rate file
rate_df <- read.csv(rate_file)
# it has below columns:
#'speciation_rate''extinction_rate''clade''diversification_rate'

# Calculate the mean speciation rate for each clade
rate_df_2 <- aggregate(speciation_rate ~ clade, data = rate_df, mean)

# read the trait file
trait_df <- read.csv(trait_file)


# Calculate the variance of traits within each clade
trait_variance_df <- aggregate(trait ~ clade, data = trait_df, var)
names(trait_variance_df)[2] <- "trait_variance"

trait_variance_df$standard_variance <- sqrt(trait_variance_df$trait_variance)
# now 'trait_variance_df' has three columns:
# clade	trait_variance	standard_variance


# Combine the trait variance and speciation rate dataframes by 'clade'
combined_df <- merge(trait_variance_df, rate_df_2, by = "clade")


# prepare trait data for correlation test
trait_1 <- combined_df[['standard_variance']]
trait_1 <- setNames(trait_1, combined_df[['clade']])
trait_1

spp <- names(trait_1)

trait_2 <- combined_df[['speciation_rate']]
trait_2 <- setNames(trait_2, combined_df[['clade']])
trait_2

# Standardize variables
trait_1 <- scale(trait_1)
trait_2 <- scale(trait_2)


# PGLS test
# BM model
corBM <- corBrownian(phy=subfam_tree,form=~spp)

pgls.fit.BM <- gls(trait_1~trait_2,correlation=corBM, method='ML')
pgls.fit.BM.summary <- summary(pgls.fit.BM)


# Extract coefficient and p-value for trait_2
coefficient_trait_2 <- pgls.fit.BM.summary$tTable[trait_2_row_index, "Value"]
p_value_trait_2 <- pgls.fit.BM.summary$tTable[trait_2_row_index, "p-value"]

...

## draw BM
pdf("Estimated.trait1_trait2.scatter.reg.PGLS.BM.std.speciation.pdf")
## set margins
par(mar = c(6.1, 6.1, 1.1, 1.1), mgp = c(4, 1, 0))
## graph scatterplot of contrasts
plot(trait_1~trait_2,
    ylab=ylab,
    xlab=xlab,
    pch=21,
    col="black",  # Outline color
    bg=clade_colors[spp],  # Fill color based on clades
    cex=2,
    las=1,
    cex.axis=1.5,
    cex.lab=2,
    bty="n")

text(trait_2, trait_1, labels = spp, cex = 1.5, pos = text_positions, xpd = FALSE)

# Add text for coefficient_trait_2 and p_value_trait_2
text(x = 0.02, y = -0.1, cex = 1.5, labels = paste("Slop :", round(coefficient_trait_2, 3), ", p-value:", round(p_value_trait_2, 3)), pos = 4, col = "#984EA3")

## add gridlines to the plot
abline(h=0,lty="dotted")
abline(v=0,lty="dotted")

## reset graphing limits of the plot to the
## x/y range of our PICs
clip(min(trait_2),max(trait_2), min(trait_1),max(trait_1))
# clip(x_min, x_max, y_min, y_max)

## graph our fitted line
# a, b : single values specifying the intercept and the slope of the line
abline(a = coef(pgls.fit.BM)[1], b = coef(pgls.fit.BM)[2],lwd=2,col="#984EA3")

dev.off()

A resulting file Estimated.trait1_trait2.scatter.reg.PGLS.BM.std.speciation.pdf will be generated, showing the relationship between speciation rate and standard variance ('disparity').

2.6.6 Subfolder `6.relationships/`

Task: Using the QuaSSE/ES-Sim/PGLS method to test for the correlations between trait evolution and species diversification, among others.

This folder contains the following subfolders:

.
├── 2.trait-trait.correlation
└── 3.trait-dependent-diversification

2.6.6.1 `2.trait-trait.correlation/`

Task: Test for correlations between different traits using PGLS, and get trait distribution over microhabitats (leaf, space, and ground)

.
├── 1.body_size.vs.leg_length
├── 2.body_size.vs.leg_length.to.body_size.ratio
├── 3.body_size.vs.m_tibia_1.to.body_size.ratio
├── 4.leg_length.vs.m_tibia_1.to.body_size.ratio
├── 5.body_size.vs.raw-microhabitat
├── 6.body_size.vs.ratio-microhabitat
├── 7.leg_length.vs.raw-microhabitat
├── 8.leg_length.vs.ratio-microhabitat
├── 9.leg_length.to.body_size.ratio.vs.raw-microhabitat
├── 10.leg_length.to.body_size.ratio.vs.ratio-microhabitat
├── 11.m_tibia_1.to.body_size.ratio.vs.raw-microhabitat
├── 12.m_tibia_1.to.body_size.ratio.vs.ratio-microhabitat
├── 13.raw-microhabitat.vs.ratio-microhabitat
└── 14.species-richness_vs_trait_std

Here are several examples.

Allometric relationship between body size and leg length

$ cd 6.relationships/2.trait-trait.correlation/1.two-traits/1.body_size.vs.leg_length/2.partitioned/treePL/ClipKIT_smart-gap

The script file two-traits-corelation.R:

treefile <- "input.tre"
trait_file <- 'clean.trait1_trait2.txt'

ylab <- "log2(body size)"
xlab <- "log2(leg length)"

tree <- read.tree(file=treefile)
length(tree$tip.label)
tree

trait_df <- read.table(trait_file)
head(trait_df)

# prepare trait data

trait_1 <- trait_df[['V2']]
trait_1 <- setNames(trait_1, trait_df[['V1']])
head(trait_1)

trait_2 <- trait_df[['V3']]
trait_2 <- setNames(trait_2, trait_df[['V1']])
head(trait_2)

# ...
# ...
spp <- names(trait_1)


# BM model
corBM <- corBrownian(phy=tree,form=~spp)
corBM

pgls.fit.BM <- gls(trait_1~trait_2,correlation=corBM, method='ML')

pgls.fit.BM.summary <- summary(pgls.fit.BM)
pgls.fit.BM.summary

coef(pgls.fit.BM)
abs(coef(pic.fit)[1]-coef(pgls.fit.BM)[2])

summary(pic.fit)$coefficients[1,4]
summary(pgls.fit.BM)$tTable[2,4]
coef(pgls.fit.BM)

## draw BM
pdf("trait1_trait2.scatter.reg.PGLS.BM.pdf")
## set margins
par(mar=c(5.1,5.1,1.1,1.1))
## graph scatterplot of contrasts
plot(trait_1~trait_2,
    ylab=ylab,
    xlab=xlab,
    pch=21,bg="gray",cex=1.2,las=1,
    cex.axis=0.7,cex.lab=0.9,bty="n")

## add gridlines to the plot
abline(h=0,lty="dotted")
abline(v=0,lty="dotted")

## reset graphing limits of the plot to the
## x/y range of our PICs
clip(min(trait_2),max(trait_2), min(trait_1),max(trait_1))

## graph our fitted line
# a, b : single values specifying the intercept and the slope of the line
abline(a = coef(pgls.fit.BM)[1], b = coef(pgls.fit.BM)[2],lwd=2,col="red")

dev.off()

For better visualization, we can color the data points by subfamily, which can be done with the script 6.relationships/2.trait-trait.correlation/1.two-traits/1.body_size.vs.leg_length/2.partitioned/treePL/ClipKIT_smart-gap/coloring_data_points/PGLS_by_subfam.v2.1.R.

Relationship between trait disparity and species richness

$ cd 6.relationships/2.trait-trait.correlation/1.two-traits/14.species-richness_vs_trait_std/3.leg_length.to.body_size.ratio/ClipKIT_smart-gap/

We test the relationship with both estimated species richness and currently known species richness:

.
├── 1.Estimate
└── 2.Current-And-Undescribed

In the file, for instance, we have:

# smart-gap
tree_file <- "/path/to/5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/input.tre.monophyletic_subfamily.collapsed.tre"


trait_1_file <- "/path/to/5.ancestral_state/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap/BM.internal-trait_by_group.csv"

trait_2_file <- "../Estimated.subfam.count.tsv"


ylab <- "Standardized Standard Variance"
xlab <- "Standardized Species richness"

# Define the colors for each clade (color-blind friendly palette)
clade_colors <- c(
  "Smeringopinae" = "#0072B2",  # Blue
  "Pholcinae" = "#E69F00",      # Orange
  "Arteminae" = "#F0E442",      # Yellow
  "Modisiminae" = "#009E73",    # Green
  "Priscula" = "#D55E00",       # Red
  "Artema" = "#CC79A7",         # Pink
  "Ninetinae" = "#56B4E9",      # Sky Blue
  "Pholcidae" = "#984EA3"       # Purple
)


## 1. subfamily-level tree

subfam_tree <- read.tree(tree_file)
subfam_tree

str(subfam_tree)


## 3. trait data
trait_1_df <- read.csv(trait_1_file)
head(trait_1_df)


## 3.1 Calculate the variance of traits within each clade
trait_1_variance_df <- aggregate(trait ~ clade, data = trait_1_df, var)
names(trait_1_variance_df)[2] <- "trait_variance"

trait_1_variance_df$standard_variance <- sqrt(trait_1_variance_df$trait_variance)
trait_1_variance_df
# now it has three columns:
# clade	trait_variance	standard_variance


trait_2_df <- read.table(trait_2_file, col.names = c("clade", "species_num"))
trait_2_df


combined_df <- merge(trait_1_variance_df, trait_2_df, by = "clade")
combined_df


### prepare trait data
trait_1 <- combined_df[['standard_variance']]
trait_1 <- setNames(trait_1, combined_df[['clade']])
trait_1

spp <- names(trait_1)

trait_2 <- combined_df[['species_num']]
trait_2 <- setNames(trait_2, trait_2_df[['clade']])
trait_2

# Standardize variables
trait_1 <- scale(trait_1)
trait_2 <- scale(trait_2)


# BM model
corBM <- corBrownian(phy=subfam_tree,form=~spp)
corBM

pgls.fit.BM <- gls(trait_1~trait_2,correlation=corBM, method='ML')

pgls.fit.BM.summary <- summary(pgls.fit.BM)
pgls.fit.BM.summary


# Find the row index for "trait_2"
trait_2_row_index <- which(rownames(pgls.fit.BM.summary$tTable) == "trait_2")

# Extract coefficient and p-value for trait_2
coefficient_trait_2 <- pgls.fit.BM.summary$tTable[trait_2_row_index, "Value"]
coefficient_trait_2

p_value_trait_2 <- pgls.fit.BM.summary$tTable[trait_2_row_index, "p-value"]
p_value_trait_2


# Example vector specifying text positions for each point
# 1 = below, 2 = left, 3 = above, 4 = right
text_positions <- rep(4, length(trait_1)) # Initialize all positions to 4 (right)

# Adjust positions as needed based on data
# Here you can customize this vector based on specific conditions
for (i in 1:length(trait_1)) {
  if (spp[i] == "Modisiminae" || spp[i] == "Pholcinae") {
    text_positions[i] = 2
  }
}
## draw BM
pdf("Estimated.trait1_trait2.scatter.reg.PGLS.BM.std.pdf")
## set margins
par(mar = c(6.1, 6.1, 1.1, 1.1), mgp = c(4, 1, 0))
## graph scatterplot of contrasts
plot(trait_1~trait_2,
    ylab=ylab,
    xlab=xlab,
    pch=21,
    col="black",  # Outline color
    bg=clade_colors[spp],  # Fill color based on clades
    cex=2,
    las=1,
    cex.axis=1.5,
    cex.lab=2,
    bty="n")

text(trait_2, trait_1, labels = spp, cex = 1.5, pos = text_positions, xpd = FALSE)

# Add text for coefficient_trait_2 and p_value_trait_2
text(x = 0.02, y = -0.1, cex = 1.5, labels = paste("Slop :", round(coefficient_trait_2, 3), ", p-value:", round(p_value_trait_2, 3)), pos = 4, col = "#984EA3")

## add gridlines to the plot
abline(h=0,lty="dotted")
abline(v=0,lty="dotted")

## reset graphing limits of the plot to the
## x/y range of our PICs
clip(min(trait_2),max(trait_2), min(trait_1),max(trait_1))
# clip(x_min, x_max, y_min, y_max)

## graph our fitted line
# a, b : single values specifying the intercept and the slope of the line
abline(a = coef(pgls.fit.BM)[1], b = coef(pgls.fit.BM)[2],lwd=2,col="#984EA3")

dev.off()

Leg length distribution over microhabitats:

$ cd 6.relationships/2.trait-trait.correlation/1.two-traits/7.leg_length.vs.raw-microhabitat/2.partitioned/treePL/ClipKIT_smart-gap

pdf("trait-distribution-by-microhabitat.pdf")

cols <- c("#F76D5E", "#FFFFBF", "#72D8FF")

ggplot() +
geom_density(data=trait_df,  aes(x = trait_1), colour = "red", linetype = "dashed") +
geom_density(data=trait_df,  aes(x = trait_1, fill = microhabitat), alpha = 0.6) +
  scale_fill_manual(values = cols) +
xlab(trait_1_name)

dev.off()

2.6.6.2 `3.trait-dependent-diversification/`

Task: Test for correlations between trait evolution and species diversification.

I will explain this with relative leg length-dependent diversification as an example:

$ cd 6.relationships/3.trait-dependent-diversification/1.one-trait-dependent/5.leg_length.to.body_size.ratio.vs.divrsification/2.partitioned/treePL/ClipKIT_smart-gap

QuaSSE

The analysis here basically follows the following tutorial

Analysing diversification with diversitree (Rich FitzJohn, with GeoSSE by Emma Goldberg; https://www.zoology.ubc.ca/prog/diversitree/doc/diversitree-tutorial.pdf

The QuaSSE (Quantitative State Speciation and Extinction) model in the R package diversitree (v0.9-20) (FitzJohn 2010), which simultaneously analyzes the relationship between changes in continuous traits and diversification and can suffer from model misspecification and phylogenetic pseudoreplication (Harvey and Rabosky 2018).

A global sampling fraction was applied, since the model does not support clade-specific sampling fractions.

We fitted the speciation and extinction rates as constant (1.nodrift.mle.c.R), linear (*.mle.l.R), sigmoidal (*.mle.s.R), and hump-shaped (*.mle.h.R) functions of the traits, respectively.

We also tested the models without the drift parameter (i.e., assuming a constant mean of trait values through time).

Finally, we performed a likelihood ratio test to compare the model fits.

$ ls -l *R | awk '{print $NF}'
1.nodrift.mle.c.R
2.1.mle.s.R
2.2.mle.l.R
2.3.mle.h.R
3.1.mle.s.R
3.2.mle.l.R
3.3.mle.h.R
4.1.mle.s.R
4.2.mle.l.R
4.3.mle.h.R

The 1.nodrift.mle.c.R assumes speciation ('lambda') and extinction ('mu') rates are a constant function of trait (f.c <- make.spiders(constant.x, constant.x)).

The 2.1.mle.s.R assumes the speciation rate is a sigmoidal function of a trait, while the extinction rate is a constant function of a trait (f.s <- make.spiders(sigmoid.x, constant.x)), combined with the presence or absence of the drift parameter:

mle.s <- find.mle(nodrift(f.s), p.s, control=control, verbose=0)
mle.d.s <- find.mle(f.s, coef(mle.s, TRUE), control=control, verbose=0)

The 3.1.mle.s.R assumes the extinction rate is a sigmoidal function of a trait, while the speciation rate is a constant function of a trait (f2.s <- make.spiders(constant.x, sigmoid.x)), combined with the presence or absence of the drift parameter:

mle2.s <- find.mle(nodrift(f2.s), p2.s, control=control, verbose=0)
mle2.d.s <- find.mle(f2.s, coef(mle2.s, TRUE), control=control, verbose=0)

The 4.1.mle.s.R assumes both speciation rate and extinction rate are a sigmoidal function of a trait (f3.s <- make.spiders(sigmoid.x, sigmoid.x)), combined with the presence or absence of the drift parameter:

mle3.s <- find.mle(nodrift(f3.s), p3.s, control=control, verbose=0)
mle3.d.s <- find.mle(f3.s, coef(mle3.s, TRUE), control=control, verbose=0)

Note that the estimated parameter values from a previous step serve as the initial values of the subsequent steps (i.e., 1.nodrift.mle.c.R -> 2.1.mle.s.R -> 3.1.mle.s.R -> 4.1.mle.s.R). Each step can take a very long time to run. This is why I save the results from the previous step into an Rdata file (e.g., save.image(file='until.nodrift.mle.c.Rdata')) and load them again at the subsequent step (e.g., load('until.nodrift.mle.c.Rdata')).

Finally, model comparison is performed with the anova() function in R. Please refer to the Jupyter Notebook single_partition.annova.ipynb for details.

ES-Sim test

The tip-rate correlation ES-sim test, which does not require a fully parameterized model of diversification and trait evolution, and directly examines the correlation between trait variation across the tips of a phylogeny and tip-specific speciation rate estimates, and is thus believed to have rare false inferences (Harvey and Rabosky 2018).

We used 20,000 simulations to build the null distribution of trait-speciation associations for significance testing:

es_result <- essim(tree, trait_vector, nsim = 20000)

Please refer to https://github.com/mgharvey/ES-sim for more details.

2.6.7 Subfolder `7.disparity.28-05-2024/`

Task: Draw disparity through time plots with the function dtt() for each subfamily and each trait.

For instance,

$ cd 7.disparity.28-05-2024/3.leg_length.to.body_size.ratio/2.partitioned/treePL/ClipKIT_smart-gap

Rscript ../../../../subfamily_disparity.r

For better visualization, I also save the related data into a file (see the file subfamily_disparity.r):

save.image(file='my_subfamily_disparity.RData')

With the data file my_subfamily_disparity.RData, we redraw the disparity through time plot using the Jupyter Notebook file redraw-subfam-disparity.ipynb. The resulting files are all_dtt_plots.colored.pdf and all_dtt_plots.r2.colored.pdf, where different subfamilies are colored with different colors.

2.6.8 Subfolder `8.linage_through_time/`

Task: Draw a Lineage through time plot for the Pholcidae family using the function ltt.plot().

For instance, use the Jupyter Notebook file 6_comparative_analysis/8.linage_through_time/2.partitioned/treePL/ClipKIT_smart-gap/llt_plot.ipynb.

2.7 Folder `7_UCE_and_six-gene_phylogeny/`

Task: Perform phylogenetic analysis with UCE+Sanger data.

Result file: combined_UCE+Sanger_trees.pdf

Main conclusion: The phylogenetic trees from UCE+Sanger data are highly consistent with our six-gene trees, and all misplacements of taxa are found to be represented by CO1 only; thus, our conclusions on the correlations among evolutionary flexibility and other variables are robust.

In this section, we performed:

unpartitioned analysis based on ALL UCE loci + Sanger data, using IQ-TREE with the optimal model determined by ModelFinder (Kalyaanamoorthy et al. 2017) in IQ-TREE automatically.
partitioned anlaysis based on 55 UCE loci + Sanger data. The 55 UCEs were selected from matrix 3, and they have the most parsimonious sites. The tree was constructed with RAxML-NG and the model GTR+R4+I+FO.

You can find the commands and alignments, resulting tree files (*.treefile) at the corresponding folders.

./
├── 1_unpartitioned
│   ├── matrix_3.gappy
│   │   ├── FcC_supermatrix.phy
│   │   ├── iqtree.sh
│   │   ├── m3_unpartitioned_gappy.longName.tre
│   │   └── m3_unpartitioned_gappy.treefile
│   ├── matrix_3.kpic-smart-gap
│   │   ├── FcC_supermatrix.phy
│   │   ├── iqtree.sh
│   │   ├── m3_unpartitioned_kpic-smart-gap.longName.tre
│   │   └── m3_unpartitioned_kpic-smart-gap.treefile
│   └── matrix_3.smart-gap
│       ├── FcC_supermatrix.phy
│       ├── iqtree.sh
│       ├── m3_unpartitioned_smart-gap.longName.tre
│       └── m3_unpartitioned_smart-gap.treefile
├── 2_partitioned
│   ├── matrix_3.55UCEs.gappy
│   │   ├── 55UCE-Sanger.partitioned.RAxML-NG.gappy.longName.tre
│   │   ├── 55UCE-Sanger.partitioned.RAxML-NG.gappy.tree
│   │   ├── merged.m3.six-gene.FcC_supermatrix_partition.txt
│   │   ├── merged.m3.six-gene.FcC_supermatrix.phy
│   │   └── raxml.sh
│   ├── matrix_3.55UCEs.kpic-smart-gap
│   │   ├── 55UCE-Sanger.partitioned.RAxML-NG.kpic-smart-gap.longName.tre
│   │   ├── 55UCE-Sanger.partitioned.RAxML-NG.kpic-smart-gap.tree
│   │   ├── merged.m3.six-gene.FcC_supermatrix_partition.txt
│   │   ├── merged.m3.six-gene.FcC_supermatrix.phy
│   │   └── raxml.sh
│   └── matrix_3.55UCEs.smart-gap
│       ├── 55UCE-Sanger.partitioned.RAxML-NG.smart-gap.longName.tre
│       ├── 55UCE-Sanger.partitioned.RAxML-NG.smart-gap.tree
│       ├── merged.m3.six-gene.FcC_supermatrix_partition.txt
│       ├── merged.m3.six-gene.FcC_supermatrix.phy
│       └── raxml.sh
├── April-9-2024.code.subfam.tsv
├── combined_UCE+Sanger_trees.pdf
└── combined.code.subfam.tsv

April-9-2024.code.subfam.tsv: relationships between the sample codes and long names
combined.code.subfam.tsv: indicating to which subfamily a sample code belongs.

The complex interplay between evolutionary flexibility and diversification in a family of spiders

Data files

Abstract

README: The complex interplay between evolutionary flexibility and diversification in a family of spiders

1. SUPPLEMENTARY MATERIAL

2. Further details on the file data_and_scripts.zip

2.1 Folder 1_script/

2.2 Folder 2_phyluce/

2.2.1 Subfolder 1_raw_data/

2.2.2 Subfolder 2_clean_data/

2.2.3 Subfolder 3_assembly/

2.2.4 Subfolder 4_detect_cross_contamination/

2.2.5 Subfolder 5_match_contigs_to_probes/

2.2.6 Subfolder 6_extract_uce_loci/

2.3 Folder 3_UCE_phylogeny/

2.3.1 Subfolder 1_trimmed_msa/

2.3.2 Subfolder 2_tree_construction/

Example

2.3.3 Subfolder 3_UCE_phylogeny/3_dating/

2.4 Folder 4_long-name-short-nam/

2.5 Folder 5_six_gene_phylogeny/

2.5.1 Subfolder 1_trimmed_msa/

2.5.2 Subfolder 2_tree_construction/

2.5.3 Subfolder 3_dating/

2.6 Folder 6_comparative_analysis/

2.6.0 Subfolder 0.dated_trees/

2.6.1 Subfolder 1.diversification.BAMM/

2.6.1.1 Folder May-14-2024.Estimated/2.partitioned/treePL/

2.6.1.2 Folder May-14-2024.Current/2.partitioned/treePL/

2.6.2 Subfolder 2.trait_evolution.BAMM.subfam-level-sampling-fraction/

2.6.3 Subfolder 3.model_fitting/

2.6.4 Subfolder 4.phylogenetic_signal/

2.6.5 Subfolder 5.ancestral_state/