Different orthology inference algorithms generate similar predicted orthogroups among Brassicaceae species
Data files
Sep 11, 2024 version files 1.69 GB
-
diploid_newMetrics_noDup_240221.csv
33.99 MB
-
diploid_pairwiseMetrics_240212.csv
58.76 MB
-
diploid_speciesPairsJIvalues_toCorrect_240222.csv
211.33 MB
-
full_newMetrics_noDup_240221.csv
38.34 MB
-
full_pairwiseMetrics_240209.csv
64.81 MB
-
geneOutput_230928.zip
24.76 MB
-
geneOutput_230929.zip
20.93 MB
-
mixed_speciesPairsJIvalues_toCorrect_240226.csv
664.43 MB
-
README.md
5.96 KB
-
spPairs_diploid_newMetrics_noDup_240223.csv
146.32 MB
-
spPairs_mixed_newMetrics_noDup_240227.csv
430.70 MB
Abstract
Premise – Orthology inference is crucial for comparative genomics, and multiple algorithms have been developed to identify putative orthologs for downstream analyses. Despite the abundance of proposed solutions, including publicly available benchmarks, it is difficult to assess which tool to best use for plant species, which commonly have complex genomic histories.
Methods – We explored the performance of four orthology inference algorithms – OrthoFinder, SonicParanoid, Broccoli, and OrthNet – on eight Brassicaceae genomes in two groups: one group comprising only diploids and another set comprising the diploids, two mesopolyploids, and one recent hexaploid genome.
Results – Orthogroup compositions reflect the species’ ploidy and genomic histories. Additionally, the diploid set had a higher proportion of identical orthogroups. While the diploid+higher ploidy set had a lower proportion of orthogroups with identical compositions, the average degree of similarity between the orthogroups was not different from the diploid set.
Discussion – Three algorithms – OrthoFinder, SonicParanoid, and Broccoli – are helpful for initial orthology predictions. Results from OrthNet were generally an outlier but could provide detailed information about gene colinearity. With our Brassicaceae dataset, slight discrepancies were found across the orthology inference algorithms, necessitating additional analyses, such as tree inference to fine-tune results.
README: Different orthology inference algorithms generate similar predicted orthogroups among Brassicaceae species
https://doi.org/10.5061/dryad.8sf7m0cw8
Description of the data and file structure
Primary data, scripts, and outputs are found on GitHub: https://github.com/itliao/OrthologyComparison
The first page outlines the types of files in each of the directories, and each directory has a separate README that walks through the steps of the analyses, important scripts, inputs for the scripts, and outputs generated from the scripts.
Some of the input/output files are too large to host on GitHub, and thus are found in this repository:
Gene_Composition_Comparison_Orthogroups - OUTPUTS
The following files are outputs from making comparisons between the orthogroup gene compositions from two different orthology inference algorithms. The "diploid set" refers to orthogroup inferences made with only the diploid species, and the "diploid+higher ploidy set" refers to orthogroup inferences made with all eight species comprising diploid and polyploid species.
diploid set
- diploid_pairwiseMetrics_240212.csv
- initial output of the pairwise metrics (RS, ARS, Jaccard) from the script when comparing orthogroup gene compositions. However, there are duplicate orthogroup comparisons that need to be removed and/or combined
- diploid_newMetrics_noDup_240221.csv
- output after correcting for the duplicate pairwise metrics (RS, ARS, Jaccard) in orthogroup comparisons. Used for generating summary heatmap figures
- geneOutput_230929.zip - 26499 separate files
- zipped folder that includes the table summary of the gene composition of the orthogroups based on the Arabidopsis gene. For instance, to generate and compare the orthogroup compositions anchored by gene AT1G01010, we found the orthogroup that included AT1G01010 from each orthology inference algorithm. We created a table merged the information from all the orthology inference programs (except OrthNet) together, where all the genes found in the corresponding orthogroups that included AT1G01010 were included. Empty cells indicate that the gene was missing or not included in any orthogroup/cluster in that specific orthology inference algorithm.
- table columns include:
- gene copy found in the orthogroup
- each orthology inference algorithm: BR (Broccoli), OFb (OrthoFinder-BLAST), OFd (OrthoFinder-DIAMOND), OFm (OrthoFinder-MMseqs), SPd (SonicParanoid-DIAMOND), SPm (SonicParanoid-MMseqs). OrthNet was not included.
- each cell indicates the particular orthogroup/cluster that the gene (row name) was found. Empty cells indicate that the gene was missing or not included in any orthogroup/cluster in that specific orthology inference algorithm.
diploid+higher ploidy set
- full_pairwiseMetrics_240209.csv
- initial output of the pairwise metrics (RS, ARS, Jaccard) from the script when comparing orthogroup gene compositions. However, there are duplicate orthogroup comparisons that need to be removed and/or combined
- full_newMetrics_noDup_240221.csv
- output after correcting for the duplicate pairwise metrics (RS, ARS, Jaccard) in orthogroup comparisons. Used for generating summary heatmap figures
- geneOutput_230928.zip - 26602 separate files
- zipped folder that includes the table summary of the gene composition of the orthogroups based on the Arabidopsis gene. For instance, to generate and compare the orthogroup compositions anchored by gene AT1G01010, we found the orthogroup that included AT1G01010 from each orthology inference algorithm. We created a table merged the information from all the orthology inference programs (except OrthNet) together, where all the genes found in the corresponding orthogroups that included AT1G01010 were included. Empty cells indicate that the gene was missing or not included in any orthogroup/cluster in that specific orthology inference algorithm.
- table columns include:
- gene copy found in the orthogroup
- each orthology inference algorithm: BR (Broccoli), OFb (OrthoFinder-BLAST), OFd (OrthoFinder-DIAMOND), OFm (OrthoFinder-MMseqs), SPd (SonicParanoid-DIAMOND), SPm (SonicParanoid-MMseqs). OrthNet was not included.
- each cell indicates the particular orthogroup/cluster that the gene (row name) was found. Empty cells indicate that the gene was missing or not included in any orthogroup/cluster in that specific orthology inference algorithm.
Gene_Composition_Comparison_Species_Pairs - OUTPUTS
The following files are outputs from making comparisons between the orthogroup gene compositions from the BASELINE comparison (OrthoFinder-BLAST-MCL) and one of the orthology inference algorithms for each pair of species. The "diploid set" refers to orthogroup inferences made with only the diploid species, and the "diploid+higher ploidy set" refers to orthogroup inferences made with all eight species comprising diploid and polyploid species.
diploid set
- diploid_speciesPairsJIvalues_toCorrect_240222.csv
- initial output from script when comparing SPECIES PAIRS orthogroup gene compositions. However, there are duplicate comparisons that need to be removed and/or combined. The metric calculated was the Jaccard Index (or JI).
- spPairs_diploid_newMetrics_noDup_240223.csv
- after correcting for the duplicate orthogroup comparisons. Used for generating summary heatmap figures.
diploid+higher ploidy set
- mixed_speciesPairsJIvalues_toCorrect_240226.csv
- initial output from script when comparing SPECIES PAIRS orthogroup gene compositions. However, there are duplicate comparisons that need to be removed and/or combined. The metric calculated was the Jaccard Index (or JI).
- spPairs_mixed_newMetrics_noDup_240227.csv
- after correcting for the duplicate orthogroup comparisons. Used for generating summary heatmap figures.
Methods
We tested seven variations of four orthology inference algorithms - OrthoFinder, SonicParanoid, Broccoli, and OrthNet. We used publicly available genomes and associated protein files from eight Brassicaceae species and compared the results from these algorithms on two sets of species: one consisting of five diploid species and one consisting of eight species - five of the diploids, two mesopolyploids, and one hexaploid. We examined the similarities and differences in the results to understand the performance of these algorithms in inferring orthologs and orthogroups with and without genomically complex species.