Metazoa-level USCOs as markers in species delimitation and classification
Data files
Dec 26, 2023 version files 689.73 MB
-
all_data_astral_trees_descriptions.txt
-
all_data_astral_trees.zip
-
all_data_concat_alignments_descriptions.txt
-
all_data_concat_alignments.zip
-
all_data_concat_trees_descriptions.txt
-
all_data_concat_trees.zip
-
all_data_gene_alignments_descriptions.txt
-
all_data_gene_alignments.zip
-
all_data_gene_trees_descriptions.txt
-
all_data_gene_trees.zip
-
all_data_partition_files.zip
-
all_data_partitions_descriptions.txt
-
astral_trees_descriptions.txt
-
astral_trees.zip
-
concat_alignments_descriptions.txt
-
concat_alignments.zip
-
concat_trees_descriptions.txt
-
concat_trees.zip
-
gene_alignments_descriptions.txt
-
gene_alignments.zip
-
gene_trees_descriptions.txt
-
gene_trees.zip
-
harvesting_dryad_readme.txt
-
metazoa_busco_aa_alignments.zip
-
metazoa_busco_aa_astral.tre
-
metazoa_busco_aa_concat.tre
-
metazoa_busco_aa_trees.zip
-
metazoa_busco_nt_alignments.zip
-
metazoa_busco_nt_trees.zip
-
metazoa_busco_nt12_astral.tre
-
metazoa_busco_nt12_concat.tre
-
nmds_plots_descriptions.txt
-
nmds_plots.zip
-
partition_files_descriptions.txt
-
partition_files.zip
-
README.md
-
snp_alignments_descriptions.txt
-
snp_alignments.zip
-
snp_nmds_descriptions.txt
-
snp_nmds.zip
-
species_delimitation_descriptions.txt
-
species_delimitation.zip
-
structure_plots_descriptions.txt
-
structure_plots.zip
-
usco_co-occurrence.txt
-
usco_diff_histograms_normed.pdf
-
usco_diff_histograms.pdf
-
usco_position_diffs.zip
-
usco_position_tables.zip
-
uscos_per_chromosome.zip
Abstract
Metazoa-level Universal Single-Copy Orthologs (USCOs) are universally applicable markers for DNA taxonomy in animals which can replace or supplement single-gene barcoding. While Metazoa-level USCOs from target enrichment data were shown to reliably distinguish species, it remains to be tested whether USCOs are an evenly distributed, representative sample of a given metazoan genome, and hence can facilitate detection of past hybridization events. Besides, unlinked loci are a principal assumption in coalescent-based species delimitation approaches. 239 chromosome-level genomes were analyzed to show that Metazoa-level USCOs are a representative sample of a genome: in terms of distances to each other on a chromosome, but also over the chromosomes, they are almost as evenly distributed as protein-coding genes in general are. We tested the suitability of Metazoa-level USCOs extracted from genomes for species delimitation and phylogeny in four case studies: Anopheles mosquitos, Drosophila fruit flies, Heliconius butterflies, and Darwin’s finches. In almost all instances USCOs allowed delineating species and yielded phylogenies that correspond to those generated from whole genome data. Our results show that USCO genes can be considered as genetically unlinked for practical purposes and representative for an entire metazoan genome. Our phylogenetic analyses demonstrate that USCOs may complement single-gene barcoding and provide more accurate taxonomic inferences. Combining USCOs from sources that used different versions of ortholog reference libraries to infer marker orthology may be challenging and at times impact taxonomic conclusions. However, we expect this problem to become less severe as the size of genome reference libraries and their sampling of organismic lineages is rapidly increasing.
README
Supplementary files for study "From benchmarking to systematics: exploiting metazoan USCOs from whole genome data", L. Dietz et al., submitted
For questions, contact Dirk Ahrens (d.ahrens@leibniz-lib.de) or Lars Dietz (l.dietz@leibniz-lib.de)
Each of these ZIP files contains a folder with several files containing alignments, phylogenetic trees, or other data. For more information on these files, please consult the accompanying description files.
All alignments, including SNP datasets, are in FASTA format and can be opened with standard alignment viewers. Except in the metazoan alignments, ambiguity codes (R, Y, W, S, M, K) stand for positions inferred to be heterozygous.
Phylogenetic trees are in NEWICK format and can be opened in a standard phylogenetic tree viewer such as FigTree. All trees are unrooted. Partition files for concatenated alignments are in NEXUS format for use in IQ-TREE.
SNP datasets recoded for NMDS analysis are tab-delimited text and numbers have the following meaning: 0: homozygous for more common allele, 1: heterozygous, 2: homozygous for less common allele. Unknown positions are represented by empty cells.
Histograms of distances between USCOs are in PDF format. Other figures are in SVG format and can be opened e.g. with Inkscape. Histograms of USCO All other files can be opened with a standard text editor.
Description of the data and file structure
all_data_astral_trees.zip: Coalescent-based trees from ASTRAL analysis of gene trees based on USCO DNA sequences extracted with all three approaches.
(see all_data_astral_trees_descriptions.txt for details)
all_data_concat_alignments.zip: Concatenated alignments of USCO DNA sequences extracted with all three approaches.
(see all_data_concat_alignments_descriptions.txt for details)
all_data_concat_trees.zip: Phylogenetic trees based on concatenated alignments of USCO DNA sequences extracted with all three approaches.
(see all_data_concat_trees_descriptions.txt for details)
all_data_gene_alignments.zip: DNA alignments of individual USCO loci extracted with all three approaches.
(see all_data_gene_alignments_descriptions.txt for details)
all_data_gene_trees.zip: Phylogenetic trees based on DNA alignments of individual USCO loci extracted with all three approaches.
(see all_data_gene_trees_descriptions.txt for details)
all_data_partition_files.zip: Partition files for concatenated alignments of USCO DNA sequences extracted with all three approaches.
(see all_data_partitions_descriptions.txt for details)
astral_trees.zip: Coalescent-based trees from ASTRAL analysis of gene trees based on DNA alignments of individual USCO loci of the four study cases.
concat_alignments.zip: Concatenated alignments of USCO DNA sequences of the four study cases.
(see concat_alignments_descriptions.txt for details)
concat_trees.zip: Phylogenetic trees based on concatenated alignments of USCO DNA sequences of the four study cases.
(see concat_alignments_descriptions.txt for details)
gene_alignments.zip: DNA alignments of individual USCO loci of the four study cases.
(see gene_alignments_descriptions.txt for details)
gene_trees.zip: Phylogenetic trees of individual USCO loci of the four study cases.
(see gene_trees_descriptions.txt for details)
metazoa_busco_aa_alignments.zip: Alignments of protein sequences of individual USCO genes extracted from metazoan genomes with BUSCO, with badly aligned regions removed.
metazoa_busco_aa_trees.zip: Phylogenetic trees based on protein sequences of individual USCO genes extracted from metazoan genomes with BUSCO.
metazoa_busco_nt_alignments.zip: Alignments of DNA sequences (1st and 2nd codon positions) of individual USCO genes extracted from metazoan genomes with BUSCO, with badly aligned regions removed.
metazoa_busco_nt_trees.zip: Phylogenetic trees based on DNA sequences (1st and 2nd codon positions) of individual USCO genes extracted from metazoan genomes with BUSCO.
nmds_plots.zip: NMDS plots based on SNPs from USCO loci of the four study cases.
(see nmds_plots_descriptions.txt for details)
partition_files.zip: Partition files for concatenated alignments of USCO DNA sequences of the four study cases.
(see partition_files_descriptions.txt for details)
snp_alignments.zip: Alignments of informative SNPs from USCO loci of the four study cases.
(see snp_alignments_descriptions.txt for details)
snp_nmds.zip: Tables of biallelic SNPs for NMDS analysis from USCO loci of the four study cases.
(see snp_nmds_descriptions.txt for details)
species_delimitation.zip: Tables of species-level entities inferred with SODA and tr2 based on USCO loci of the four study cases.
(see species_delimitation_descriptions.txt for details)
structure_plots.zip: Results of STRUCTURE clustering based on SNPs from USCO loci of the four study cases.
(see structure_plots_descriptions.txt for details)
usco_position_diffs.zip: Tables in tsv format containing distances between starting points of neighboring USCO genes in metazoan genomes. Columns are, from left to right, OrthoDB ID numbers of the genes, distance between them, logarithm of the distance, distance normalized to genome size, and logarithm of the distance normalized to genome size.
usco_position_tables.zip: Tables in tsv format output by BUSCO containing positions of USCO genes in metazoan genomes. Columns are, from left to right, OrthoDB ID number of the gene, status, NCBI accession number of the chromosome/contig sequence, starting position of the coding sequence, end position, quality score, and length.
uscos_per_chromosome.zip: Tables showing, for each analyzed genome, the name of each chromosome, its length in base pairs, the number of all protein-coding genes it contains, and number of mzl-USCOs it contains.
metazoa_busco_aa_astral.tre: Coalescent-based tree from ASTRAL analysis of gene trees based on protein alignments of USCO genes in metazoan genomes.
metazoa_busco_aa_concat.tre: Maximum-likelihood tree based on concatenated protein alignments of USCO genes in metazoan genomes.
metazoa_busco_nt12_astral.tre: Coalescent-based tree from ASTRAL analysis of gene trees based on DNA alignments (1st and 2nd codon positions) of USCO genes in metazoan genomes.
metazoa_busco_nt12_concat.tre: Maximum-likelihood tree based on concatenated DNA alignments (1st and 2nd codon positions) of USCO genes in metazoan genomes.
usco_co-occurrence.txt: Table of frequencies with which pairs of USCO genes occur on the same chromosome across metazoan taxa. Columns are, from left to right, OrthoDB ID number of first gene, OrthoDB ID number of second gene, and frequency.
usco_diff_histograms.pdf: Histograms of distances between starting points of neighboring USCO genes in metazoan genomes. x-axis shows logarithm of distance, y-axis shows number of distances in the respective size class.
usco_diff_histograms_normed.pdf: Histograms of distances between starting points of neighboring USCO genes in metazoan genomes normalized by genome size. x-axis shows logarithm of normalized distance, y-axis shows number of distances in the respective size class.
Sharing/Access information
Links to other publicly accessible locations of the data:
- none
Data was derived from the following sources:
- Metazoan genome assemblies at NCBI (https://www.ncbi.nlm.nih.gov/), see list in S1 of paper
- Raw reads from genome sequencing projects of Anopheles, Drosophila, Heliconius and Darwin's finches at NCBI (https://www.ncbi.nlm.nih.gov/), see list in S2 of paper
Code/Software
removegaps_d.pl: This script removes all positions from FASTA alignments for which sequence data is available in less than a specified number of taxa. Requires, in that order, the path to the folder containing input alignments, the minimum number of taxa for which sequence information must be available for alignment positions to be kept, and a folder where output alignments are put. Input alignment file names must end in .fas. Example: removegaps_d.pl input_folder/ 5 output_folder/
removegaps_snp_inf_d.pl: This script removes non-parsimony-informative SNPs from SNP datasets in FASTA format. Requires, in that order, the path to the folder containing input data and a folder where output alignments are put. Input file names must end in .fas. Example: removegaps_snp_inf_d.pl input_folder/ output_folder/
removegaps_snp_d.pl: This script removes all positions from SNP datasets in FASTA format that are missing in more than a specified number of taxa. It is recommended to initially set this to 0 or some other small number, then delete all empty files from your output folder, and increase by one and repeat until no more empty files are produced. Input file names must end in .fas. Requires, in that order, the path to the folder containing input alignments, the maximum number of taxa containing a gap, and a folder where output alignments are put. Example: removegaps_snp_d.pl input_folder/ 0 output_folder/
snp-pca_d.pl: This script converts an SNP dataset in FASTA format, including only SNPs with at most two alleles, to a format usable for PCA or NMDS. A majority consensus sequence of the input alignment, as a file containing only the sequence, must be created beforehand. Requires, in that order, name of the input, consensus, and output files. Input alignment file names must end in .fas. Example: snp-pca_d.pl input.fas cons.txt output.txt
extract_codpos_d.pl: This script removes the third codon position from all nucleotide alignments (FASTA format) in a folder. Requires, in that order, the paths to the folder containing input alignments and an output folder for modified alignments. Names of alignment files must end in .fas. Example: extract_codpos_d.pl input_folder/ output_folder/
concat_eogs_part_d.pl: This script creates a concatenated alignment FASTA file from all alignments (FASTA format) in a folder. It also creates a partition file in NEXUS format listing each alignment as a partition. Requires, in that order, the path to the folder containing input alignments, a name for the concatenated output alignment, and the partition file. Names of alignment files must end in .fas. Example: concat_eogs_part_d.pl input_folder/ concat.fas partition.nex
extract_chromo_d.pl: This script removes all non-chromosome contigs from eukaryotic genomes in FASTA format downloaded from NCBI. Requires, in that order, an input folder containing genome files and an output folder for genome files containing only chromosomes. Genome files must end in .fna. Example: extract_chromo_d.pl input_folder/ output_folder/
genomesize_d.pl: This script creates a table of genome sizes in base pairs from genome files in FASTA format. Requires, in that order, an input folder containing genome files and a name for the genome size table. Genome files must end in .fna. Example: genomesize_d.pl input_folder/ genomesize.txt
busco_diff_d.pl: This script creates tables of absolute and normalized distances and their logarithms between neighboring USCO genes in metazoan genomes based on a BUSCO output table. Requires, in that order, a table of genome sizes created with genomesize_d.pl, an input folder containing the BUSCO output tables for each genome, and an output folder for the tables. Genome files must be in FASTA format and end in .fna. Names of BUSCO output tables must be of the form taxon.tsv, where “taxon” is the taxon name listed in the genome size table. Example: busco_diff_d.pl genomesize.txt input_folder/ output_folder/
busco_diff_freq_d.pl: This script creates a table of frequencies of normalized logarithmic size classes of distances between neighboring USCO genes in metazoan genomes. Requires, in that order, an input folder containing a table of distances created with busco_diff_d.pl, and a name for the output file. Example: busco_diff_freq_d.pl input_folder/ sizefreqs.txt
busco_diff_median_d.pl: This script creates a table of medians of distances between neighboring USCO genes in metazoan genomes. Requires, in that order, an input folder containing a table of distances created with busco_diff_d.pl, and a name for the output file. Example: busco_diff_freq_d.pl input_folder/ usco_medians.txt
gene_random_diffs_d.pl: This script creates tables containing the median distance between neighboring genes from a random sample of genes from a set of genomes. CDS files of each genome must be in the folder structure as downloaded from NCBI. Requires, in that order, a tab-separated table containing the accession number for each genome and the number of genes to be sampled from the genome, a folder containing all sub-folders with the CDS files of each genome, the number of times the sampling is to be repeated, and an output folder for the files containing median distances for each genome. Example: gene_random_diffs_d.pl cds_files/ 10000 output_folder/
gene_sample_median_d.pl: This script creates a table containing, in that order, the median distance between neighboring randomly sampled genes, the quotient of the distance between neighboring USCO genes by that number, and the proportion of random samples for which the median distance of randomly sampled genes is lower than that of USCO genes. Requires, in that order, the directory with the tables output by gene_random_diffs_d.pl, the table output by busco_diff_median_d.pl, the number of times the sampling was repeated, and a name for the output file. Example: gene_sample_median_d.pl median_tables/ usco_medians.txt 10000 median_compare.txt
usco_chromo_numbers_d.pl: This script creates, for each of a set of genomes, a table containing, in that order, the names of all chromosomes, their lengths in base pairs, the number of protein coding genes in each chromosome, and the number of complete single-copy USCO genes in it. Requires, in that order, a tab-separated table containing the accession numbers of all genomes and the name of the respective organism, a folder containing sub-folders with the sequence reports from NCBI for all genomes, a folder containing sub-folders with the CDS files, a folder containing the tables output by BUSCO listing all USCOs found in the genome, and a folder for the output files. Sequence reports and CDS files of each genome must be in the folder structure as downloaded from NCBI. Names of BUSCO output tables must contain the organism name (e.g. Homo_sapiens.tsv). Example: usco_chromo_numbers_d.pl genome_table.txt sequence_reports/ cds_files/ busco_tables/ output_folder/
usco_evenness_d.pl: This script creates a table listing, for each of a set of genomes, the evenness (defined as e^H/S, with H being the entropy of the distribution and S being number of chromosomes) of, in that order, chromosome length, number of protein-coding genes per chromosome, and number of USCO genes per chromosome. Requires, in that order, the folder containing tables output by usco_chromo_numbers_d.pl, and a name for the output file. Example: usco_evenness_d.pl chromo_numbers/ evenness.txt
usco_chi_square_d.pl: This script creates a table listing, for each of a set of genomes, the chi-square value showing the deviation of the actual distribution of USCOs between chromosomes from a distribution proportional to chromosome sizes, the chi-square value showing the deviation from a distribution proportional to the number of protein-coding genes in general, and the number of degrees of freedom. p-values can then be calculated e.g. by Microsoft Excel. Requires, in that order, the folder containing tables output by usco_chromo_numbers_d.pl, and a name for the output file. Example: usco_chi_square_d.pl chromo_numbers/ chi_square.txt
busco_chromo_table_d.pl: This script creates a table of proportions with which any pairs of USCO genes occur on the same chromosome across a number of genomes. Requires, in that order, an input folder containing BUSCO output files and a name for the output table. Names of input files must end in .tsv. Example: busco_chromo_table_d.pl input_folder/ chromo_table.txt
usco_overlap_d.pl: This script produces an upper diagonal matrix with the proportion of overlapping positions between each pair of taxa in an alignment. Input alignment must be in FASTA format, with each sequence in a single line. Requires, in that order, name of a text file listing the taxa in correct order, the input alignment, and the output file. Example: usco_overlap_d.pl taxonlist.txt input.fas output.txt