UnFATE: A comprehensive probe set and bioinformatics pipeline for phylogeny reconstruction and multilocus barcoding of filamentous ascomycetes (Ascomycota, Pezizomycotina)
Data files
Jan 23, 2025 version files 242.17 MB
-
alignments_Pezizo_pilotTE.tar.xz
4.79 MB
-
Aspergillus_Penicillium_Steenwyk_2019_single_locus_alignments.tar.gz
5.22 MB
-
Aspergillus_Penicillium_Steenwyk_2019_supermatrices.tar.gz
12.43 MB
-
baits.tar.gz
2.21 MB
-
Dothideomycetes_Haridas_2020_single_locus_alignments_gblocked.tar.gz
7.79 MB
-
Dothideomycetes_Haridas_2020_supermatrices.tar.gz
12.98 MB
-
final_trees_Pezizo_pilotTE.tar.xz
6.22 KB
-
Parmeliaceae_Pizarro_2019_single_locus_alignments_gblocked.tar.gz
3.25 MB
-
Parmeliaceae_Pizarro_2019_supermatrices.tar.gz
6.59 MB
-
Pezizomycotina_Li_2021_single_locus_alignments.tar.gz
44.78 MB
-
Pezizomycotina_Li_2021_supermatrices.tar.gz
111.66 MB
-
README.md
4.64 KB
-
README.txt
257 B
-
single_locus_trees_Pezizo_pilotTE.tar.xz
23.16 MB
-
supermatrix_Pezizo_pilotTE.tar.xz
5.98 MB
-
unfate_markers_reference_sequences_DNA.tar.gz
1.31 MB
Abstract
The subphylum Pezizomycotina (filamentous ascomycetes) is the largest clade within Ascomycota. Despite the importance of this group of fungi, our understanding of their evolution is still limited due to insufficient taxon sampling. Although next-generation sequencing technology allows us to obtain complete genomes for phylogenetic analyses, generating complete genomes of fungal species can be challenging, especially when fungi occur in symbiotic relationships or when the DNA of rare herbarium specimens is degraded or contaminated. Additionally, assembly, annotation, and gene extraction of whole-genome sequencing data require bioinformatics skills and computational power, resulting in a substantial data burden. To overcome these obstacles, we designed a universal target enrichment probe set to reconstruct the phylogenetic relationships of filamentous ascomycetes at different phylogenetic levels. From a pool of single-copy orthologous genes extracted from available Pezizomycotina genomes, we identified the smallest subset of genetic markers that can reliably reconstruct a robust phylogeny. We used a clustering approach to identify a sequence set that could provide an optimal trade-off between potential missing data and probe set cost. We incorporated this probe set into a user-friendly wrapper script named UnFATE (https://github.com/claudioametrano/UnFATE) that allows phylogenomic inferences without requiring expert bioinformatics knowledge. In addition to phylogenetic results, the software provides a powerful multilocus alternative to ITS-based barcoding. Phylogeny and barcoding approaches can be complemented by an integrated, pre-processed, and periodically updated database of all publicly available Pezizomycotina genomes. The UnFATE pipeline, using the 195 selected marker genes, consistently performed well across various phylogenetic depths, generating trees consistent with the reference phylogenomic inferences. The topological distance between the reference trees from literature and the best tree produced by UnFATE ranged between 0.10 and 0.14 (nRF) for phylogenies from family to subphylum level. We also tested the in vitro success of the universal baits set in a target capture approach on 25 herbarium specimens from ten representative classes in Pezizomycotina, which recovered a topology mostly congruent with recent phylogenomic inferences for this group of fungi. The discriminating power of our gene set was also assessed by the multilocus barcoding approach, which outperformed the barcoding approach based on ITS. With these tools, we aim to provide a framework for a collaborative approach to build robust, conclusive phylogenies of this important fungal clade.
README: UnFATE: A Comprehensive Probe Set and Bioinformatics Pipeline for Phylogeny Reconstruction and Multilocus Barcoding of Filamentous Ascomycetes (Ascomycota, Pezizomycotina)
The repository includes the representative sequences of the UnFATE 195 genes and the baits designed from them, the single locus trees, alignments and final phylogenies for the proof of concept Pezizomycotina phylogeny inferred using the universal probe set and the pipeline we developed (files ending in "Pezizo_pilotTE"). It also includes the supermatrices and single locus alignments generated by mining the 195 genes of our gene set from publicly available genome, used in published phylogenomic inferences.
File description: the following tar.gz files contain the the reference sequences (unfate_markers_reference_sequences_DNA.tar.gz) obtained from the clustering approach adopted to find the best representative sequences to build the universal bait set (baits.tar.gz).
See Ametrano et al. 2025 (Systematic Biology journal) for comprehensive methods.
File list:
- unfate_markers_reference_sequences_DNA.tar.gz
- baits.tar.gz
File description: the following tar.gz files contain the analysis performed using the UnFATE pipeline on existing genome assembly datasets used in previously published phylogenomic inferences (e.g. Parmeliaceae metagenome dataset from Pizarro et al., 2019). Both the single locus alignments and the concatenated (block-filtered) alignments are provided.
In single locus alignment archives: "AA" stands for amino-acid "NT" for nucleotides, if the -gb suffix is present, the alignments were block filtered with Gblocks. "macsed" stands for aligned using MACSE2 pipeline, the beginning of the file name describes the gene name and the taxon code from OrthoDB (e.g., 1627at4890).
In supematrix archives: FcC stands for Fastconcat (the tool used to perform concatenation). "AA" stands for amino-acid "NT" for nucleotides, if the -gb suffix is present, the alignments were block filtered with Gblocks.
File list:
- Aspergillus_Penicillium_Steenwyk_2019_supermatrices.tar.gz
- Aspergillus_Penicillium_Steenwyk_2019_single_locus_alignments.tar.gz
- Dothideomycetes_Haridas_2020_supermatrices.tar.gz
- Dothideomycetes_Haridas_2020_single_locus_alignments_gblocked.tar.gz
- Parmeliaceae_Pizarro_2019_single_locus_alignments_gblocked.tar.gz
- Parmeliaceae_Pizarro_2019_supermatrices.tar.gz
- Pezizomycotina_Li_2021_single_locus_alignments.tar.gz
- Pezizomycotina_Li_2021_supermatrices.tar.gz
File description: the following tar.xz files contain the analysis performed using the UnFATE pipeline on the newly generated Pezizomycotina target enrichment (TE in the file name) data captured with the universal bait set we developed (see the manuscript). Single locus alignments, supermatrices (DNA and AA), single locus trees and final trees (Astral DNA and AA phylogeny, Maximum-Likelihood DNA and AA) are provided.
File list:
- Supermatrix_Pezizo_pilotTE.tar.xz: in the archive there are the two supermatrix folders (amino-acids and nucleotides) in each of them there all the files produced by Fasconcat (supermatrix, partition file and statistics) and the subsequent IQTREE2 ML run using the supermatrix itself (see IQTREE2 manual for additional details).
- Final_trees_Pezizo_pilotTE.tar.xz: within the archives, species trees produced by Astral summary method have "astralspeciestree" as prefix, the species trees from concatenation and ML inference with IQTREE have "FcC_supermatrix*_partition" as prefix, "blocked" means that alignment were blocked before concatenation or single tree inference, "AA" or "aa" stands for amino-acid, "dna" or "NT" stands for nucleotides. "SPECIES_NAME" is in the tree files where the accession numbers are complemented with species name (if the sample is from an NCBI assembly). Files whose name starts with "Accession" contain the taxonomic information about the species name present in the tree files.
- Alignments_Pezizo_pilotTE.tar.xz: within the archive "AA" stands for amino-acid "NT" for nucleotide, "trimal" means alignments were block filtered with TrimAl. "macsed" stands for aligned using MACSE2 pipeline, the beginning of the file name describes the gene name and the taxon code from OrthoDB (e.g., 1627at4890). "Cleaned" is reported in the name of fasta files where sequences with > 75% missing data (gaps) were deleted.
- Single_locus_trees_Pezizo_pilotTE.tar.xz: contains the IQTREE2 output for each single locus inference performed to generate single gene trees (see IQTREE2 manual for additional details).