Data for Schultz et al. (2023) _Ancient gene linkages support ctenophores as sister to other animals_. ===================== Brief Description ----------------- This dataset contains the software, genomes, analyses, and results for the study from Schultz et al. _Ancient gene linkages support ctenophores as sister to other animals_. (2023). This study involved sequencing and assembling chromosome-scale genomes of three unicellular organisms, one ctenophore, and two sponge species. Using these, and publicly-available genomes, the authors presented evidence that ctenophores are the sister clade to all other animals. The evidence used as a basis for this argument was the discovery of ancestral chromosomal linkage groups (ALGs) found to be conserved between unicellular organisms and animals, the configuration of these linkage groups in animal genomes, and the evolutionary implications of the chromosomal configurations. Provenance for this README ----------------------------------- * File name: README\_v20230308.txt * Authors: Darrin T. Schultz Dataset Version and Release History ----------------------------------- * First Published Version: * Number: v20230308 * Date: March 8th, 2023 * Persistent identifier: DOI: https://doi.org/10.5061/dryad.dncjsxm47 Associated Publication ----------------------------------- * Article Title: _Ancient gene linkages support ctenophores as sister to other animals_ * Article Authors: Darrin T. Schultz (1,2,3\*), Steven H.D. Haddock (2,4), Jessen V. Bredeson (5), Richard E. Green (3), Oleg Simakov (1\*), Daniel S. Rokhsar (5,6,7\*) * Publication Date: Not yet published * Journal: Not yet published * Issue: Not yet published (1) Department for Neurosciences and Developmental Biology, University of Vienna, Vienna 1010, Austria. (2) Monterey Bay Aquarium Research Institute, Moss Landing, California 95039, USA. (3) Department of Biomolecular Engineering and Bioinformatics, University of California, Santa Cruz, California 95064, USA. (4) Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, California 95064, USA. (5) Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA. (6) Molecular Genetics Unit, Okinawa Institute of Science and Technology Graduate University, 1919-1, Tancha, Onna, Okinawa, 904-0495 Japan. (7) Chan Zuckerberg Biohub, 499 Illinois St., San Francisco, CA 94158, USA. *Corresponding authors: Corresponding Author Contact Information ---------------------------------------- * Name: Darrin T. Schultz * Affiliations: University of Vienna * ORCID ID: https://orcid.org/0000-0003-1190-1122 * Email: darrin.schultz@univie.ac.at * Alternate Email: dschultz@mbari.org * Alternate Email 2: dts@ucsc.edu * Alternative Contact: postdoctoral PI * Name: Oleg Simakov * Affiliations: University of Vienna * ORCID ID: https://orcid.org/0000-0002-3585-4511 * Email: oleg.simakov@univie.ac.at * Alternative Contact: senior author * Name: Daniel S. Rokhsar * Affiliations: University of California, Berkeley * Affiliation 2: Okinawa Institute of Science and Tecnology * Affiliation 3: Chan Zuckerberg Biohub * ORCID ID: https://orcid.org/0000-0002-8704-2224 * Email: DSRokhsar@gmail.com * Contributor ORCID IDs: * Steven H.D. Haddock: https://orcid.org/0000-0001-9420-4482 * Jessen V. Bredeson: https://orcid.org/0000-0001-5489-8512 * Richard E. Green: https://orcid.org/0000-0003-0516-5827 Dataset Attribution and Usage ----------------------------- * Dataset Title: Data for Schultz et al. (2023) _Ancient gene linkages support ctenophores as sister to other animals_. * Persistent Identifier: https://doi.org/10.5061/dryad.dncjsxm47 * Dataset Contributors: * Creator: Darrin T. Schultz * Contributor: Jessen V. Bredeson * Date of Issue: 20230308 * Publisher: Monterey Bay Aquarium Research Institute * License: Use of these data is covered by the following license: * Title: CC0 1.0 Universal (CC0 1.0) * Specification: https://creativecommons.org/publicdomain/zero/1.0/ * Suggested Citations: * Dataset citation: > Schultz, D.T. Data for the article _Ancient gene linkages support ctenophores as sister to other animals_. Dryad, Dataset, https://doi.org/10.5061/dryad.dncjsxm47 * Corresponding publication: > D.T. Schultz, S.H.D. Haddock, J.V. Bredeson, R.E. Green, D.S. Rokhsar. _Ancient gene linkages support ctenophores as sister to other animals_. Unpublished. * Associated software static repository: (https://doi.org/10.5281/zenodo.7707938)[https://doi.org/10.5281/zenodo.7707938] * Associated software long-term development: (https://github.com/conchoecia/odp)[https://github.com/conchoecia/odp] Funding Sources --------------- * David and Lucile Packard Foundation * Monterey Bay Aquarium Research Institute * National Science Foundation, Award: NSF GRFP DGE 1339067 to D.T.S. * National Science Foundation, Award: NSF DEB-1542679 to S.H.D.H. * H2020 European Research Council, Award: 945026 to O.S. * Okinawa Institute of Science and Technology * Chan Zuckerberg Initiative * University of California Berkeley, Award: Marthella Foskett Brown Chair in Biology - - - Description of the dataset ========================== Summary Metrics --------------- * File count: 3 * Uncompressed file count: 110881 * Total file size: 7.63 Gb, compressed * file size `genomes.tar.gz`: 2.4Gb * file size `supplementary_information.tar.gz`: 5.4 Gb * File formats: `.tar.gz`, `.fasta`, `.chrom`, `.pdf`, `.yaml`, `.rbh`, `.sh` Table of Contents ----------------- * README_20230308.md * genomes.tar.gz * supplementary_information.tar.gz Setup ----- This instructions assume a knowledge of the unix command line. * Unpacking instructions: * Download both `genomes.tar.gz` and `supplementary_information.tar.gz` to the same directory to preserve the directory struture. * Navigate to the directory containing the `genomes.tar.gz` and `supplementary_information.tar.gz` files with the `cd` command. * Then, un-tar.gz all of the files with the command `unp *.tar.gz`, or with the command `tar –xvzf *.tar.gz` * Recommended software/tools: [the odp software available here](https://doi.org/10.5281/zenodo.7707938) Software -------- There are some associated software files hosted on zenodo here: [https://doi.org/10.5281/zenodo.7707938](https://doi.org/10.5281/zenodo.7707938). This dataset contains stable releases of the software package, [`odp`](https://github.com/conchoecia/odp) (short for Oxford dot plot), developed for this publication. There is a version of the software from the initial submission of this manuscript in May 2022, and a more recent version from February 2023. The `README.md` files in these two directories hosted on Zenodo detail the usage of the software. There are additional python scripts on the Zenodo repository that are necessary to generate the data linked below for Supplementary Information 10 and Supplementary Information 11. Data General Description ------------------------ There are two main directories in this dataset. The `genomes.tar.gz` file contains a directory holding all of the genomes assembled _de novo_ for this manuscript, as well as the genomes of other organisms that have been formatted specifically for this study. For two species, _Nematostella vectensis_ and _Ephydatia muelleri_, we include bash scripts to download and format the genome assembly and protein files as they are not publicly available from an open-access source such as NCBI or GigaDB. For all other genomes we have included the already-published genomes which chromosome headers modified for ease-of-analysis, the protein fasta files, and our copies of the annotations. Currently, this repository is the only host for the genomes of the unicellular organisms published in this study: the genomes of _Capsaspora owczarzaki_, _Creolimax fragrantissima_, and _Salpingoeca rosetta_. The genomes of the ctenophore and two sponge species published with this study are also on NCBI - see the information below. The `supplementary_information.tar.gz` file contains directories with config files and analysis results from the fifteen supplementary information sections of the manuscript. In some cases, where results are all contained in a figure or extended data figure in the manuscript, the analysis results are omitted, and only the requisite config file is included. The software to complete these analyses are located in the associated [Zenodo repository](https://doi.org/10.5281/zenodo.7707938), specifically the odp software and the scripts necessary to complete the analyses for Supplementary Information 10 and 11. - - - File/Folder Details =================== Details for `genomes.tar.gz` ---------------------------------------------- All files reside in a directory called `for_odp`. This directory is here to maintain the relative file structure used in some of the config files. This directory contains the genome assemblies used in this manuscript, both those assembled for this study, and those previously released. Each directory contains: * A `.fasta` file: This is the genome assembly. Chromosome-scale scaffolds of the assemblies are named using the organism's three-letter code (defined below) as a prefix, and the chromosome number as a suffix. For example, the scaffold of _Homo sapiens_ chromosome 3 will have `>HSA3` as the header. * A `.pep` file: This file contains the protein sequences encoded in this organism's genome. * A `.chrom` file: This file contains the coordinates of the proteins in this organism's genome. Please see the `README.md` file of [odp v0.3.0](https://zenodo.org/record/7707938/files/odp-0.3.0.zip?download=1) in the [zenodo repository](https://doi.org/10.5281/zenodo.7707938) for a detailed description of this file format. Please see the manuscript for additional information about these genomes. Copyright Statement for NCBI: ["NCBI itself places no restrictions on the use or distribution of the data contained therein. Nor do we accept data when the submitter has requested restrictions on reuse or redistribution"](https://www.ncbi.nlm.nih.gov/home/about/policies/) Now [CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en) with the publication of this Dryad repository. * `bolinopsis/` * `Binomial Name:` _Bolinopsis microptera_ * `Three-letter Code:` BMI * `Classification:` lobate ctenophore * `Common Name:` none * `Genome Assembly Provenance:` Sequenced, assembled, scaffolded, and manually curated to chromosome-scale for this manuscript * `Genome Assembly License:` Now [CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en) with the publication of this Dryad repository. * `Protein File Provenance:` The annotation is described in this manuscript. * `Protein File License:` Now [CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en) with the publication of this Dryad repository. * `Genome citation:` See above, Schultz et al. (2023). * `Associated BioProject:` [PRJNA818620](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA818620) * `branchiostoma/` * `Binomial Name:` _Branchiostoma floridae_ * `Three-letter Code:` BFL * `Classification:` non-vertebrate chordate, cephalochordate, amphioxus * `Common Name:` Florida lancelet * `Genome Assembly Provenance:` [Available on NCBI: GCA_000003815.2](https://www.ncbi.nlm.nih.gov/assembly/GCF_000003815.2/) * `Genome Assembly License:` See NCBI Copyright Statement above. * `Protein File Provenance:` [Available on NCBI](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/815/GCF_000003815.2_Bfl_VNyyK/GCF_000003815.2_Bfl_VNyyK_protein.faa.gz) * `Protein File License:` See NCBI Copyright Statement above. * `Genome Citation:` [Simakov, Oleg, et al. "Deeply conserved synteny resolves early events in vertebrate evolution." Nature Ecology & Evolution 4.6 (2020): 820-830.](http://dx.doi.org/10.1038/s41559-020-1156-z) * `Associated BioProject:` [PRJNA818620](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA818620) * `capsaspora/` * `Note:` This directory contains three subdirectories for three separate assembly versions for _Capsaspora owczarzaki_. Please see the supplementary information of the manuscript for a description of the differences between these assembly versions. * `Note:` The original assembly details are below. For this study we scaffolded the already-published genome to chromosome scale with Hi-C reads we generated. The BioProject for the Hi-C reads generated for this study are located here: [PRJNA818537](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA818537). The genome assembly files contained in this directory are the chromosome-scale versions that we generated. * `Binomial Name:` _Capsaspora owczarzaki_ * `Three-letter Code:` COW * `Classification:` filasterean amoeba, outgroup of animals * `Common Name:` None * `Genome Assembly Provenance:` [Available on NCBI: GCA_000003815.2](https://www.ncbi.nlm.nih.gov/assembly/GCF_000151315.2/) * `Genome Assembly License:` See NCBI Copyright Statement above. * `Protein File Provenance:` [Available on NCBI](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/151/315/GCF_000151315.2_C_owczarzaki_V2/GCF_000151315.2_C_owczarzaki_V2_protein.faa.gz) * `Protein File License:` See NCBI Copyright Statement above. * `Genome Citation:` [Denbo, Seitaro, et al. "Revision of the Capsaspora genome using read mating information adjusts the view on premetazoan genome." Development, Growth & Differentiation 61.1 (2019): 34-42.](https://doi.org/10.1111/dgd.12587) * `Associated BioProjects:` [PRJNA193613](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA193613) and [PRJNA20341](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA20341) * `chondrisia_reniformis/` * `Note:` The _Petrosia_ and _Chondrosia_ genomes from the Aquatic Symbiosis Genomics project of the Tree of Life Programme, Wellcome Sanger Institute, were funded by the Gordon and Betty Moore Foundation and the Wellcome Trust. * `Binomial Name:` _Chondrosia reniformis_ * `Three-letter Code:` CRE * `Classification:` Demosponge * `Common Name:` Kidney sponge * `Genome Assembly Provenance:` [Available on NCBI: GCA_947172415.1](https://www.ncbi.nlm.nih.gov/assembly/GCA_947172415.1/) * `Genome Assembly License:` See NCBI Copyright Statement above. * `Protein File Provenance:` The annotation is described in the manuscript, but the protein sequences were generated _de novo_ for this manuscript. * `Genome Citation:` None * `Associated BioProjects:` [PRJEB55902](https://www.ncbi.nlm.nih.gov/bioproject/879045) * `cladorhizid_v0.6_hapA/` * `Binomial Name:` None * `Three-letter Code:` CLA * `Classification:` bioluminescent cladorhizid demosponge, genome haplotype A * `Common Name:` None * `Genome Assembly Provenance:` Sequenced, assembled, scaffolded, and manually curated to chromosome-scale for this manuscript * `Genome Assembly License:` Now [CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en) with the publication of this Dryad repository. * `Protein File Provenance:` The annotation is described in this manuscript. * `Protein File License:` Now [CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en) with the publication of this Dryad repository. * `Genome citation:` See above, Schultz et al. (2023). * `Associated BioProjects:` [PRJNA818630](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA818630) * `cladorhizid_v0.6_hapB/` * `Binomial Name:` None * `Three-letter Code:` CLA * `Classification:` bioluminescent cladorhizid demosponge, genome haplotype B * `Common Name:` None * `Genome Assembly Provenance:` Sequenced, assembled, scaffolded, and manually curated to chromosome-scale for this manuscript * `Genome Assembly License:` Now [CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en) with the publication of this Dryad repository. * `Protein File Provenance:` The annotation is described in this manuscript. * `Protein File License:` Now [CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en) with the publication of this Dryad repository. * `Genome citation:` See above, Schultz et al. (2023). * `Associated BioProjects:` [PRJNA818630](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA818630) * `creolimax/` * `Note:` The original assembly details are below. For this study we scaffolded the already-published genome to chromosome scale with Hi-C reads we generated. The BioProject for the Hi-C reads generated for this study are located here: [PRJNA818537](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA818537). The genome assembly files contained in this directory are the chromosome-scale versions that we generated. * `Note:` The protein file is not included in this directory because the proteins are available only on FigShare under a [Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license. You must run the bash script in this directory to download the protein file. * `Binomial Name:` _Creolimax fragrantissima_ * `Three-letter Code:` CFR * `Classification:` ichthyosporean, outgroup of animals * `Common Name:` None * `Genome Assembly Provenance:` [Available on NCBI: GCA_002024145.1](https://www.ncbi.nlm.nih.gov/assembly/GCA_002024145.1/) * `Genome Assembly License:` See NCBI Copyright Statement above. * `Protein File Provenance:` We do not provide a protein file in this Dryad repository. * `Protein File License:` We do not provide a protein file in this Dryad repository. * `Genome Citation:` [De Mendoza, Alex, et al. "Complex transcriptional regulation and independent evolution of fungal-like traits in a relative of animals." elife 4 (2015): e08904.](http://dx.doi.org/10.7554/eLife.08904) * Associated BioProjects:` [PRJNA377365](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA377365) * `ephydatia/` * `Note:` The protein file and genome assemblies are not included in this directory because they are only available from [EphyBase](https://spaces.facsci.ualberta.ca/ephybase/). You must run the bash script in this directory to download the protein and genome fasta files. * `Binomial Name:` _Ephydatia muelleri_ * `Three-letter Code:` EMU * `Classification:` freshwater demosponge * `Common Name:` None * `Genome Assembly Provenance:` We do not provide a genome assembly in this Dryad repository. * `Genome Assembly License:` We do not provide a genome assembly in this Dryad repository. * `Protein File Provenance:` We do not provide a protein file in this Dryad repository. * `Protein File License:` We do not provide a protein file in this Dryad repository. * `Genome Citation:` [Kenny, Nathan J., et al. "Tracing animal genomic evolution with the chromosomal-level assembly of the freshwater sponge Ephydatia muelleri." Nature communications 11.1 (2020): 3676.](http://dx.doi.org/10.1038/s41467-020-17397-w) * `Associated BioProjects:` [PRJNA579531](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA579531/) * `hexactinellid/` * `Binomial Name:` None * `Three-letter Code:` HEX * `Classification:` tulip hexactinellid * `Common Name:` None * `Genome Assembly Provenance:` Sequenced, assembled, scaffolded, and manually curated to chromosome-scale for this manuscript * `Genome Assembly License:` Now [CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en) with the publication of this Dryad repository. * `Protein File Provenance:` The annotation is described in this manuscript. * `Protein File License:` Now [CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en) with the publication of this Dryad repository. * `Genome citation:` See above, Schultz et al. (2023). * `Associated BioProjects:` [PRJNA903214](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA903214) * `hormiphora/` * `Binomial Name:` _Hormiphora californensis_ * `Three-letter Code:` HCA * `Classification:` pleurobrachiid ctenophore * `Common Name:` California sea gooseberry * `Genome Assembly Provenance:` [Available on NCBI: GCA_020137815.1](https://www.ncbi.nlm.nih.gov/assembly/GCA_020137815.1/) * `Genome Assembly License:` See NCBI Copyright Statement above. * `Protein File Provenance:` [We are the original sequence authors - also available on Zenodo](https://doi.org/10.5281/zenodo.4074309) * `Protein File License:` These sequences have not yet been published under a license. Now [CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en) with the publication of this Dryad repository. * `Genome Citation:` [Schultz, D. T., Francis, W. R., McBroome, J. D., Christianson, L. M., Haddock, S. H., & Green, R. E. (2021). A chromosome-scale genome assembly and karyotype of the ctenophore Hormiphora californensis. G3, 11(11), jkab302.](https://doi.org/10.1093/g3journal/jkab302) * `Associated BioProject:` [PRJNA576068](https://www.ncbi.nlm.nih.gov/bioproject/576068) * `human/` * `Binomial Name:` _Homo sapiens_ * `Three-letter Code:` HSA * `Classification:` vertebrate mammal * `Common Name:` human * `Genome Assembly Provenance:` [Available on NCBI: GRCh37](https://www.ncbi.nlm.nih.gov/grc/human) * `Genome Assembly License:` See NCBI Copyright Statement above. * `Protein File Provenance:` [Available on NCBI](https://www.ncbi.nlm.nih.gov/grc/human) * `Protein File License:` See NCBI Copyright Statement above. * `Genome Citation:` [US DOE Joint Genome Institute: Hawkins Trevor 4 Branscomb Elbert 4 Predki Paul 4 Richardson Paul 4 Wenning Sarah 4 Slezak Tom 4 Doggett Norman 4 Cheng Jan-Fang 4 Olsen Anne 4 Lucas Susan 4 Elkin Christopher 4 Uberbacher Edward 4 Frazier Marvin 4, et al. "Initial sequencing and analysis of the human genome." nature 409.6822 (2001): 860-921.](https://www.nature.com/articles/35057062) * `hydra/` * `Binomial Name:` _Hydra vulgaris_ * `Three-letter Code:` HVU * `Classification:` cnidarian * `Common Name:` common _Hydra_ * `Genome Assembly Provenance:` [Available on NCBI: GCF_022113875.1](https://www.ncbi.nlm.nih.gov/assembly/GCF_022113875.1/) * `Genome Assembly License:` See NCBI Copyright Statement above. * `Protein File Provenance:` [Available on NCBI](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/022/113/875/GCF_022113875.1_Hydra_105_v3/GCF_022113875.1_Hydra_105_v3_protein.faa.gz) * `Protein File License:` See NCBI Copyright Statement above. * `Genome Citation:` [Simakov, Oleg, et al. "Deeply conserved synteny and the evolution of metazoan chromosomes." Science advances 8.5 (2022): eabi5884.](http://dx.doi.org/10.1126/sciadv.abi5884) * `Associated BioProject:` [PRJNA814716](https://www.ncbi.nlm.nih.gov/bioproject/814716) * `nematostella/` * `Note:` We provide no genome assembly or protein fasta file for this repository, as the data are only publicly available on [SIMRbase](https://genomes.stowers.org/starletseaanemone). * `Binomial Name:` _Nematostella vectensis_ * `Three-letter Code:` NVE * `Classification:` cnidarian * `Common Name:` starlet sea anemone * `Genome Assembly Provenance:` We do not provide a genome assembly in this Dryad repository. * `Genome Assembly License:` We do not provide a genome assembly in this Dryad repository. * `Protein File Provenance:` We do not provide a protein file in this Dryad repository. * `Protein File License:` We do not provide a protein file in this Dryad repository. * `Genome Citation:` [Zimmermann, Bob, et al. "Sea anemone genomes reveal ancestral metazoan chromosomal macrosynteny." BioRxiv (2020): 2020-10.](https://www.biorxiv.org/content/10.1101/2020.10.30.359448v1.full). The genome is not publicly available on NCBI, but is available here: [https://genomes.stowers.org/starletseaanemone](https://genomes.stowers.org/starletseaanemone) * `Associated BioProject:` [PRJNA667495](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA667495) * `pecten_maximus/` * `Binomial Name:` _Pecten maximus_ * `Three-letter Code:` PMA * `Classification:` scallop, mollusca * `Common Name:` great scallop * `Genome Assembly Provenance:` [Available on NCBI: GCF\_902652985.1](https://www.ncbi.nlm.nih.gov/assembly/GCF_902652985.1/) * `Genome Assembly License:` See NCBI Copyright Statement above. * `Protein File Provenance:` [Available on GigaDB](https://www.doi.org/10.5524/100726) * `Protein File License:` [CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en) is used by all [GigaDB](http://gigadb.org/site/term) repositories (with exceptions, not applicable in this case). * `Genome Citation:` [Kenny, Nathan J., et al. "The gene-rich genome of the scallop Pecten maximus." Gigascience 9.5 (2020): giaa037.](http://dx.doi.org/10.1093/gigascience/giaa037) * `Associated BioProject:` [PRJNA667495](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA667495) * `Petrosia_ficiformis/` * `Note:` The _Petrosia_ and _Chondrosia_ genomes from the Aquatic Symbiosis Genomics project of the Tree of Life Programme, Wellcome Sanger Institute, were funded by the Gordon and Betty Moore Foundation and the Wellcome Trust. * `Binomial Name:` _Petrosia ficiformis_ * `Three-letter Code:` PFA * `Classification:` Demosponge * `Common Name:` stony sponge * `Genome Assembly Provenance:` [Available on NCBI: GCA_947172415.1](https://www.ncbi.nlm.nih.gov/assembly/GCA_947172415.1/) * `Genome Assembly License:` See NCBI Copyright Statement above. * `Protein File Provenance:` The annotation is described in the manuscript, but the protein sequences were generated _de novo_ for this manuscript. * `Genome Citation:` None * `Associated BioProjects:` [PRJEB55902](https://www.ncbi.nlm.nih.gov/bioproject/879045) * `rhopilema_li/` * `Binomial Name:` _Rhopilema esculentum_ * `Three-letter Code:` RES * `Classification:` cnidarian * `Common Name:` fire jellyfish * `Genome Assembly Provenance:` We reconstructed the chromosome-scale sequences publicaly available on [GigaDB](https://www.doi.org/10.5524/100720) * `Genome Assembly License:` [CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en) is used by all [GigaDB](http://gigadb.org/site/term) repositories (with exceptions, not applicable in this case) * `Protein File Provenance:` [GigaDB](https://www.doi.org/10.5524/100720) * `Protein File License:` [CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en) is used by all [GigaDB](http://gigadb.org/site/term) repositories (with exceptions, not applicable in this case) * `Genome Citation:` [Li, Yunfeng, et al. "Chromosome-level reference genome of the jellyfish Rhopilema esculentum." GigaScience 9.4 (2020): giaa036.](http://dx.doi.org/10.1093/gigascience/giaa036) * `Associated BioProject:` [PRJNA512552](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA512552) * `salpingoeca/` * `Note:` The original assembly details are below. For this study we scaffolded the already-published genome to chromosome scale with Hi-C reads we generated. The BioProject for the Hi-C reads generated for this study are located here: [PRJNA818537](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA818537). The genome assembly files contained in this directory are the chromosome-scale versions that we generated. * `Note:` The protein file is not included in this directory because the proteins are available only on FigShare under a [Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license. You must run the bash script in this directory to download the protein file. * `Binomial Name:` _Salpingoeca rosetta_ * `Three-letter Code:` SRO * `Classification:` choanoflagellate * `Common Name:` None * `Genome Assembly Provenance:` [Available on NCBI: GCF_000188695.1](https://www.ncbi.nlm.nih.gov/assembly/GCF_000188695.1/) * `Genome Assembly License:` See NCBI Copyright Statement above. * `Protein File Provenance:` [Available on NCBI](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/695/GCF_000188695.1_Proterospongia_sp_ATCC50818/GCF_000188695.1_Proterospongia_sp_ATCC50818_protein.faa.gz) * `Protein File License:` See NCBI Copyright Statement above. * `Genome Citation:` [Fairclough, Stephen R., et al. "Premetazoan genome evolution and the regulation of cell differentiation in the choanoflagellate Salpingoeca rosetta." Genome biology 14.2 (2013): 1-15.](http://dx.doi.org/10.1186/gb-2013-14-2-r15) * `Associated BioProjects:` [PRJNA193541](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA193541) * `trichoplax/` * `Binomial Name:` _Trichoplax adhaerens_ * `Three-letter Code:` TAD * `Classification:` placozoan * `Common Name:` None * `Genome Assembly Provenance:` [Available on NCBI: GCF_000150275.1](https://www.ncbi.nlm.nih.gov/assembly/173428) * `Genome Assembly License:` See NCBI Copyright Statement above. * `Protein File Provenance:` [Available on NCBI](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/150/275/GCF_000150275.1_v1.0/GCF_000150275.1_v1.0_protein.faa.gz) * `Protein File License:` See NCBI Copyright Statement above. * `Genome Citation:` [Srivastava, Mansi, et al. "The Trichoplax genome and the nature of placozoans." Nature 454.7207 (2008): 955-960.](http://dx.doi.org/10.1038/nature07191) * `Associated BioProject:` [PRJNA30931](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA30931) and [PRJNA12874](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA12874) Details for `supplementary_information.tar.gz` ---------------------------------------------- The `supplementary_information.tar.gz` file contains directories with config files and analysis results from the fifteen supplementary information sections of the manuscript. The titles of the Supplementary Information sections are listed below, and the corresponding top-level directory is also listed. ``` 1. Genome sequencing, assembly, and annotation - Bolinopsis microptera - SI_section_01_ctenophore_genome/ 2. Genome sequencing, assembly, and annotation - Sponge genomes - SI_section_02_sponge_genomes/ 3. Genome sequencing, assembly, and annotation - Unicellular Outgroup Species - SI_section_03_unicell_genomes/ 4. Chromosomal tectonic events and their phylogenetic implications - SI_section_04_phylogenetic_logic/ 5. ODP: software to perform macrosynteny analyses - SI_section_05_odp_software/ 6. Validating the methodology and ODP software with other clades - SI_section_06_OdpTesting/ 7. Macrosynteny analyses of animals and their close unicellular relatives - SI_section_07_SyntenyAnalyses/ 8. Identification of gene groups linked since the ancestor of the Filozoa - SI_section_08_UnicellALGs/ 9. Extension of gene linkage groups to other metazoan species - SI_section_09_ALGextension/ 10. OrthoFinder analysis recovers support for the ctenophore-sister hypothesis - SI_section_10_Orthofinder/ 11. Detecting conserved macrosynteny between highly rearranged genomes - SI_section_11_conservation_score/ 12. GO enrichment analysis of ALGs conserved in Filozoans - SI_section_12_GO_enrichment/ 13. Entropy of gene mixing analysis - SI_section_13_GeneMixingAnalysis/ 14. Analyzing chromosomal tectonic events in a Bayesian phylogenetic framework - SI_section_14_BayesianMethodology/ 15. Null hypothesis testing of the ctenophore-sister topology - SI_section_15_HypothesisTesting/ ``` The directory structure with content descriptions are below. * SI\_section\_06\_OdpTesting * Contains 1327 files pertaining to Supplementary Information 6 of the manuscript. * Section Title: "Validating the methodology and ODP software with other clades" * A description of the analyses carried out below are described in the Supplementary Information of the manuscript. The file contents are described briefly for orientation. * The directory titled `step1_4wayrbh` contains a snakemake config file to run `odp_nway_rbh` on the species combinations EMU-RES-PMA-BFL and EMU-RES-HVU-PMA. The final results of this analysis are the 4-way orthologs shared among these species in the directory `SI_section_06_OdpTesting/step1_4wayrbh/odp_nway_rbh/step3-unwrap/`. The other files in the repository are intermediate files. * The directory titled `step2_simulations` contains subdirectories with snakemake config files necessary to run, and the results of, running `odp_genome_mutation_analysis` on the results of `step1_4wayrbh`. The final results are present in text form and as pdfs. Search for these using `find . -name '*.pdf'`. The other files are intermediate files. ``` . ├── SI_section_01_ctenophore_genome │   └── README.md # BioProject and genome assembly accession number for Bolinopsis microptera. │  ├── SI_section_02_sponge_genomes │   └── README.md # BioProject and genome assembly accession numbers for the bioluminescent cladorhizid sponge │ ├── SI_section_03_unicell_genomes │   └── README.md # Paths of unicellular genome assemblies and the Hi-C BioProject accession number │ ├── SI_section_04_phylogenetic_logic │   └── README.md # Placeholder file to remind the user that there are no associated files with this section. │ ├── SI_section_05_odp_software │   └── README.md # Contains URLs to download the odp software (not included in this Dryad repository) │ ├── SI_section_06_OdpTesting │   ├── README.md # A description of the analyses carried out below │   ├── step1_4wayrbh # Directory containing the 4-way reciprocal best hit analysis │   │   ├── config.yaml # Snakemake configuration file for the 4-way reciprocal best hit analysis testing for cnidarian or spiralian fusion-with-mixing events │   │   └── odp_nway_rbh # The output directory resulting from running the snakemake pipeline with the config.yaml file above │   │   ├── step1-rbh # Contains the rbh files for the species combinations EMU-RES-PMA-BFL and EMU-RES-HVU-PMA │   │   │   └── *_*_*_*_reciprocal_best_hits.rbh # The .rbh files for the cnidarian or spiralian analyses are named with this convention │   │   ├── step2-groupby # After generating the .rbh files, .groupby files are created to group by chromosome combination │   │   │   ├── *_*_*_*.rbh.filt.groupby # .groupby sets of chromosomes that are "ancestral linkage groups" with a false discovery rate <=0.05 │   │   │   ├── *_*_*_*.rbh.groupby # .groupby files for the cnidarian and spiralian 4-way rbh searches that have not been filtered for noise │   │   │   └── FDR │   │   │   ├── *_*_*_*.FDR.tsv # contains the calculated false discovery rates for each four-way species comparison │   │   │   └── sim │   │   │   └── *_*_*_*_sim_*.tsv # contains the raw data for the false discovery rate calculations │   │   └── step3-unwrap │   │   └── *_*_*_*.filt.unwrapped.rbh # the filtered .groupby files above have been unwrapped back into .rbh files. Final results of the 4-way rbh analyses. │   │ │   └── step2_simulations │   ├── step2_simulation_BFL_EMU_PMA_RES # Simulation files for the BFL-EMU-PMA-RES species combination (bilaterian fusion-with-mixing events) │   │   ├── config.yaml # Snakemake configuration file for odp_genome_rearrangement_simulation │   │   └── odp_genome_mutation_analysis │   │   ├── figures │   │   │   └── EMU-BFL-RES-PMA │   │   │   ├── EMU-BFL-RES-PMA_randomization_marginals_*-randomizations.pdf # 2-axis marginal distribution heatmaps of the simulation results │   │   │   ├── EMU-BFL-RES-PMA_randomization_stats_*-randomizations.txt # Text file with summary statistics of the analysis results │   │   │   └── EMU-BFL-RES-PMA_support_*-randomizations.pdf # Summary results of shuffling the * species' genome │   │   └── sim_randomization # Contains the raw data for the simulation results │   │   └── EMU-BFL-RES-PMA │   │   ├── initialization │   │   │   ├── EMU-BFL-RES-PMA_df_of_fusionsUnmixed_*.txt # Linkage group fusion events in species * that are on a single chromosome, but are unmixed │   │   │   ├── EMU-BFL-RES-PMA_df_of_groups_supporting_*_sister.txt # pandas df of linkage groups that support * as sister to the other two non-outgroup species │   │   │   ├── EMU-BFL-RES-PMA_groups_supporting_*_sister.txt # list of orthologs that support species * as sister to the other two non-outgroup species │   │   │   ├── EMU-BFL-RES-PMA_random_sim_init_plottable.txt # Parameters that were measured from the input genomes such as number of ALGs, genes, etc. │   │   │   ├── EMU-BFL-RES-PMA_random_sim_init.txt # A list of measured parameters from the initial genomes, only used by software. │   │   │   └── EMU-BFL-RES-PMA_rows_annotated.rbh.groupby # Contains annotations of which ALGs support which hypothesis, generated post-analysis. │   │   ├── plottable │   │   │   └── EMU-BFL-RES-PMA_random_sim_*_%_plottable.txt # Contains simulation data summary stats for species *, set of trials %. │   │   ├── plottable_final │   │   │   └── EMU-BFL-RES-PMA_random_sim_final_plottable_*.txt # Contains the concatenated simulation data from the ../plottable files above. For plotting. │   │   └── trials │   │   └── EMU-BFL-RES-PMA_random_sim_*_%.txt # Contains the raw simulation data for species *, set of trials %. │   │ │   └── step2_simulation_EMU_RES_HVU_PMA # Simulation files for the EMU-RES-HVU-PMA species combination (cnidarian fusion-with-mixing events) │   # The data structure and file contents of this directory is the same as the BFL-EMU-PMA-RES directory above. | ├── SI_section_07_SyntenyAnalyses # Contains the data and code for the synteny analyses in the SI section 7 │   ├── README.md # Just a note telling the user that the files in this directory are mostly plots │   │ │   ├── HCA-EMU-RES # This directory contains the EMU-HCA-RES analyses that were used to generate Extended Data Figure 8 │   │   ├── step1_groupby │   │   │   ├── config.yaml # a snakemake configuration file for odp_rbh_to_groupby for the EMU-HCA-RES analysis │   │   │   └── odp_rbh_to_groupby │   │   │   └── output │   │   │   ├── EMU_HCA_RESLi_reciprocal_best_hits.rbh.groupby # The .groupby file from the EMU-HCA-RES analysis, unfiltered │   │   │   └── FDR │   │   │   ├── EMU_HCA_RESLi_FDR.tsv # The summary of the false discovery rate calculations for the EMU-HCA-RES analysis │   │   │   └── sim │   │   │   └── EMU_HCA_RESLi_sim_*.tsv # The raw data for the false discovery rate calculations for the EMU-HCA-RES analysis │   │   └── step2_unwrap │   │   ├── 3way_RES_EMU_HCA.filtered.rbh.groupby # The .groupby file that has been filtered to only contain chromosome pairs with a significant false discovery rate. │   │   ├── config.yaml # Snakemake configuration file for odp_groupby_to_rbh │   │   └── odp_groupby_to_rbh │   │   └── output │   │   └── EMU_HCA_RESLi.unwrapped.rbh # The raw data for the EMU-HCA-RES analysis │   │ │   └── odp2_dotplots # Oxford dot plots of the pairs of genomes published in this manuscript. Generate with the odp software. │      ├── synteny_coloredby_BCnS_LGs # Orthologs colored by the BCnS ALGs described in [Simakov et al. 2022](http://dx.doi.org/10.1126/sciadv.abi5884) │      │   ├── *_%_xy_reciprocal_best_hits.coloredby_BCnS_LGs.plotted.rbh # An annotated .rbh file for plotting the synteny between species * and % │      │   ├── *_%_xy_synteny_coloredby_BCnS_LGs.pdf # An odp plot with species * on the x-axis and species % on the y-axis │      │   └── *_%_yx_synteny_coloredby_BCnS_LGs.pdf # An odp plot with species % on the x-axis and species * on the y-axis │      │ │      ├── synteny_coloredby_UnicellMetazoanLgs # Orthologs colored by the ALGs shared between unicellular organisms and animals. These ALGs are from this manuscript. │      │   ├── *_%_xy_reciprocal_best_hits.coloredby_UnicellMetazoanLgs.plotted.rbh # An annotated .rbh file for plotting the synteny between species * and % │      │   ├── *_%_xy_synteny_coloredby_UnicellMetazoanLgs.pdf # An odp plot with species * on the x-axis and species % on the y-axis │      │   └── *_%_yx_synteny_coloredby_UnicellMetazoanLgs.pdf # An odp plot with species % on the x-axis and species * on the y-axis │      │ │      └── synteny_coloredby_UnicellMetazoanLgsOrthofinder # colored by ALGs shared between unicellular organisms and animals. Identified using OrthoFinder results. │      ├── *_%_xy_reciprocal_best_hits.coloredby_UnicellMetazoanLgsOrthofinder.plotted.rbh # An annotated .rbh file for plotting the synteny between species * and % │      ├── *_%_xy_synteny_coloredby_UnicellMetazoanLgsOrthofinder.pdf # An odp plot with species * on the x-axis and species % on the y-axis │      └── *_%_yx_synteny_coloredby_UnicellMetazoanLgsOrthofinder.pdf # An odp plot with species % on the x-axis and species * on the y-axis │   ├── SI_section_08_UnicellALGs # Files related to finding the ancestral linkage groups shared between unicellular organisms and metazoa. │   │ # The flow of data, as well as which program was used to run each step, is described in detail in Supplementary Information 8. │   │ # Briefly, each repository contains a snakemake `config.yaml` file set up to run the analyses on the output of a previous directory │   │ # and on the genomes. 4-way reciprocal best blastp orthologs are identified in several species quartets, then merged and filtered later. │   │ # The results of the quartet analyses were used to produce Figure 2 and Supplementary Table 2 of the main manuscript. │   │ # The orthologs are expanded to other species in `step8_rbh_to_hmm`, and in `step9_hmm_to_mixing` these expanded orthologs │   │ # were used to make Figure 3 of the main manuscript. │   │ # The results are summarized in Supplementary Data 2 of the main manuscript. │   │ │   ├── step1_synteny_4way # The 4-way reciprocal best blastp orthologs for the species quartets described in the main manuscript │   │   ├── config.yaml # The snakemake configuration file for the 4-way reciprocal best blastp orthologs - Runs with odp_nway_rbh │   │   └── odp_nway_rbh │   │   ├── blastp_results │   │   │   ├── reciprocal_best │   │   │   │   └── *_and_%_recip.temp.blastp # The reciprocal best blastp output for the species pairs │   │   │   │   ├── *_*_*_*_acceptable_prots.txt # The proteins that were used to make the 4-way reciprocal best blastp orthologs │   │   │   │   ├── *_*_*_*_edges.txt # The edges of the 4-way reciprocal best blastp orthologs │   │   │   ├── xtoy │   │   │   │   └── *_against_%.blastp # The raw blastp output for the species pairs, Direction * -> % │   │   │   └── ytox │   │   │   └── TAD_against_SRO.blastp # The raw blastp output for the species pairs, Direction % -> * │   │   ├── db │   │   │   ├── xaxis │   │   │   │   ├── *_prots.pep* # Protein sequences for the species for one of the search directions │   │   │   │   └── dmnd │   │   │   │      └── *_prots.dmnd # Diamond database for the species for one of the search directions │   │   │   └── yaxis │   │   │   ├── *_prots.pep* # Protein sequences for the species for one of the search directions │   │   │   └── dmnd │   │   │      └── *_prots.dmnd # Diamond database for the species for the other search direction │   │   └── rbh │   │      ├── CLA_COW_HCA_RESLi_reciprocal_best_hits.rbh # reciprocal best blastp orthologs for the species quartet CLA_COW_HCA_RES │   │      ├── COW_EMU_HCA_RESLi_reciprocal_best_hits.rbh # reciprocal best blastp orthologs for the species quartet COW_EMU_HCA_RES │   │      └── EMU_HCA_RESLi_SRO_reciprocal_best_hits.rbh # reciprocal best blastp orthologs for the species quartet EMU_HCA_RES_SRO │   │ # A rbh file for CFR-EMU-HCA-RES can be generated from the CFR_EMU_HCA_RESLi_reciprocal_best_hits.rbh.groupby file │   │ │   ├── step2_rbh_to_groupby_CFR # With the 4-way rbh files from the previous analyis, create a .groupby file for CFR-EMU-HCA-RES │   │   ├── config.yaml # snakemake config file to run the rbh-to-groupby analysis - runs with odp_rbh_to_groupby │   │   └── odp_rbh_to_groupby │   │   └── output │   │   ├── CFR_EMU_HCA_RESLi_reciprocal_best_hits.rbh.groupby # The .groupby file for the species quartet CFR_EMU_HCA_RES │   │   └── FDR │   │   ├── CFR_EMU_HCA_RESLi_FDR.tsv # The FDR values for the species quartet CFR_EMU_HCA_RES │   │   └── sim │   │   └── CFR_EMU_HCA_RESLi_sim_*.tsv # The simulation jobs to calculate the FDR values for this species quartet │   │ │   ├── step2_rbh_to_groupby_COW # With the 4-way rbh files from the previous analyis, create a .groupby file for COW-EMU-HCA-RES │   │   ├── config.yaml # snakemake config file to run the rbh-to-groupby analysis - runs with odp_rbh_to_groupby │   │   └── odp_rbh_to_groupby │   │   └── output │   │   ├── COW_EMU_HCA_RESLi_reciprocal_best_hits.rbh.groupby # The .groupby file for the species quartet COW_EMU_HCA_RES │   │   └── FDR │   │   ├── COW_EMU_HCA_RESLi_FDR.tsv # The FDR values for the species quartet COW_EMU_HCA_RES │   │   └── sim │   │   └── COW_EMU_HCA_RESLi_sim_*.tsv # The simulation jobs to calculate the FDR values for this species quartet │   │ │   ├── step2_rbh_to_groupby_SRO # With the 4-way rbh files from the previous analyis, create a .groupby file for SRO-EMU-HCA-RES │   │   ├── config.yaml # snakemake config file to run the rbh-to-groupby analysis - runs with odp_rbh_to_groupby │   │   └── odp_rbh_to_groupby │   │   └── output │   │   ├── EMU_HCA_RESLi_SRO_reciprocal_best_hits.rbh.groupby # The .groupby file for the species quartet EMU_HCA_RES_SRO │   │   └── FDR │   │   ├── EMU_HCA_RESLi_SRO_FDR.tsv # The FDR values for the species quartet EMU_HCA_RES_SRO │   │   └── sim │   │   └── EMU_HCA_RESLi_SRO_sim_*.tsv # The simulation jobs to calculate the FDR values for this species quartet │   │ │   ├── step3_groupby_filter # Filter the .groupby files to keep chromsome combinations with a FDR < 0.05 │   │   ├── config.yaml # snakemake config file to run the groupby-filter analysis - runs with odp_groupby_filter │   │   └── odp_groupby_filter │   │   └── output │   │   ├── CFR_EMU_HCA_RESLi_reciprocal_best_hits.rbh.filt.groupby # The filtered .groupby file for the species quartet CFR_EMU_HCA_RES │   │   ├── COW_EMU_HCA_RESLi_reciprocal_best_hits.rbh.filt.groupby # The filtered .groupby file for the species quartet COW_EMU_HCA_RES │   │   └── EMU_HCA_RESLi_SRO_reciprocal_best_hits.rbh.filt.groupby # The filtered .groupby file for the species quartet EMU_HCA_RES_SRO │   │ │   ├── step4_unwrap_filtered # The filtered .groupby files are unwrapped to create a .rbh file for each chromosome combination │   │   ├── config.yaml # snakemake config file to run the groupby-to-rbh analysis - runs with odp_groupby_to_rbh │   │   └── odp_groupby_to_rbh │   │   └── output │   │   ├── CFR_EMU_HCA_RESLi.unwrapped.rbh # The filtered .rbh file for the species quartet CFR_EMU_HCA_RES │   │   ├── COW_EMU_HCA_RESLi.unwrapped.rbh # The filtered .rbh file for the species quartet COW_EMU_HCA_RES │   │   └── EMU_HCA_RESLi_SRO.unwrapped.rbh # The filtered .rbh file for the species quartet EMU_HCA_RES_SRO │   │ │   ├── step5_merge_filtered_rbh # In this step the 3 filtered .rbh files are merged into a single file based on EMU-HCA-RES │   │   ├── config.yaml # snakemake config file to run the merge-rbh analysis - runs with odp_rbh_merge │   │   └── odp_rbh_merge │   │   └── output │   │   └── EMU_HCA_RESLi.mer # The merged .rbh file for the species quartet EMU_HCA_RES │   │ │   ├── step6_rbh_to_groupby # In this step the merged .rbh file is converted to a .groupby file to annotate the ALGs │   │   ├── config.yaml # snakemake config file to run the rbh-to-groupby analysis - runs with odp_rbh_to_groupby │   │   └── odp_rbh_to_groupby_noalpha │   │   └── output │   │   └── EMU_HCA_RESLi_COW_SRO_CFR.groupby # The .groupby file for the species group (CFR-SRO-COW)-EMU-HCA-RES │   │ │   ├── step7_annotated_groupby_to_rbh # In this step the annotated .groupby file is converted back to a rbh file │   │   ├── CFR_COW_EMU_HCA_RESLi_SRO.annotated.groupby # The annotated .groupby file for the species group (CFR-SRO-COW)-EMU-HCA-RES │   │   ├── config.yaml # snakemake config file to run the groupby-to-rbh analysis - runs with odp_groupby_to_rbh │   │   └── odp_groupby_to_rbh │   │   └── output │   │   └── CFR_COW_EMU_HCA_RESLi_SRO.unwrapped.rbh # The annotated .rbh file for the species group (CFR-SRO-COW)-EMU-HCA-RES │   │ │   ├── step8_rbh_to_hmm # In this step the orthologs in the .rbh file is converted to a .hmm file to be used in the HMMER search of other species │   │   ├── config.yaml # snakemake config file to run the rbh-to-hmm analysis - runs with odp_rbh_to_HMM │   │   └── odp_rbh_to_HMM │   │   ├── fasta │   │   │   ├── aligned │   │   │   │   └── mer_CFR_COW_EMU_HCA_RESLi_SRO_*.aligned.fasta # The aligned .fasta files for each ortholog of the species group (CFR-SRO-COW)-EMU-HCA-RES │   │   │   └── unaligned │   │   │   └── mer_CFR_COW_EMU_HCA_RESLi_SRO_*.fasta # The unaligned .fasta files for each ortholog of the species group (CFR-SRO-COW)-EMU-HCA-RES │   │   ├── hmm │   │   │   ├── hmms │   │   │   │   └── mer_CFR_COW_EMU_HCA_RESLi_SRO_*.hmm # The .hmm files for each ortholog of the species group (CFR-SRO-COW)-EMU-HCA-RES │   │   │   ├── searches │   │   │   │   ├── BFL │   │   │   │   │   └── mer_CFR_COW_EMU_HCA_RESLi_SRO_*_against_BFL.tsv # Ortholog .hmm *'s search results against BFL - Branchiostoma floridae │   │   │   │   ├── CEL │   │   │   │   │   └── mer_CFR_COW_EMU_HCA_RESLi_SRO_*_against_CEL.tsv # Ortholog .hmm *'s search results against CEL - C. elegans │   │   │   │   ├── CLAa │   │   │   │   │   └── mer_CFR_COW_EMU_HCA_RESLi_SRO_*_against_CLAa.tsv # Ortholog .hmm *'s search results against CLAa - Cladorhizid sponge hapA │   │   │   │   ├── CLAb │   │   │   │   │   └── mer_CFR_COW_EMU_HCA_RESLi_SRO_*_against_CLAb.tsv # Ortholog .hmm *'s search results against CLAb - Cladorhizid sponge hapB │   │   │   │   ├── HSA │   │   │   │   │   └── mer_CFR_COW_EMU_HCA_RESLi_SRO_*_against_HSA.tsv # Ortholog .hmm *'s search results against HSA - Homo sapiens │   │   │   │   ├── NVE │   │   │   │   │   └── mer_CFR_COW_EMU_HCA_RESLi_SRO_*_against_NVE.tsv # Ortholog .hmm *'s search results against NVE - Nematostella vectensis │   │   │   │   ├── PMA │   │   │   │   │   └── mer_CFR_COW_EMU_HCA_RESLi_SRO_*_against_PMA.tsv # Ortholog .hmm *'s search results against PMA - Pecten maximus │   │   │   │   └── TAD │   │   │   │   └── mer_CFR_COW_EMU_HCA_RESLi_SRO_*_against_TAD.tsv # Ortholog .hmm *'s search results against TAD - Trichoplax adhaerens │   │   │   ├── searches_agg │   │   │   │   ├── *_hmm_results.sorted.tsv # The sorted hmmer search results against species * │   │   │   │   └── *_hmm_results.tsv # The unsorted hmmer search results against species * │   │   │   └── searches_agg_best │   │   │   └── *_hmm_best.tsv # The best hmmer search result for the orthlogs against species * │   │   └── output │   │   └── CFR_COW_EMU_HCA_RESLi_SRO_rbhhmm_plus_other_species.rbh # The .rbh file for the species group (CFR-SRO-COW)-EMU-HCA-RES plus all the species listed above │   │ │   └── step9_hmm_to_mixing # With the expanded .rbh file from the previous step, we now plot the mixing of the different ALG groups │   ├── config.yaml # snakemake config file to run the hmm-to-mixing analysis - runs with odp_rbh_plot_mixing │   └── odp_rbh_plot_mixing │   ├── input │   │   └── *_chrom_sizes.tsv # The chromosome sizes for each species │   ├── output │   │   └── *_plots.pdf # The plots of ALG mixing on chromosomes for each species │   ├── output_mixing │   │   ├── *_mixing_simulation.txt # The mixing simulation results for each species, complete with the plotted distribution │   │   └── *_mixing.tsv # A table of the mixing results for each species │   └── output_mixing_merged │   └── all_species_mixing.tsv # A table of the mixing results for all the species │  ├── SI_section_09_ALGextension # Contains one README.md with instructions on where to find the relevant files for SI section 9. │   └── README.md # They are included in `SI_section_08_UnicellALGs` because the analyses are linked together │ #  logically and in terms of input data. │ ├── SI_section_10_Orthofinder # Files pertaining to Supplementary Information section 10 - Orthofinder analyses │   ├── README.txt # Has details for two programs that must be downloaded from the linked Zenodo repository to run the analyses. │   │ │   ├── run_OF.sh # The file `run_OF.sh` performs the ALG-finding analyses on the │   │ # OrthoFinder results after the additional scripts are downloaded │   │ # from Zenodo. │   │ │   ├── configtemplate.txt # template config file that is used to generate .rbh files for all of the relevant species quartets from the OrthoFinder results. │   ├── fasta │   │   └── OrthoFinder # The directory `fasta` contains the OrthoFinder results generated using the protein files detailed above. │   │   └── Results_Jul02 # The most important files that contain the information about which proteins in which species form orthogroups are located │   │ # in this folder: `SI_section_10_Orthofinder/fasta/OrthoFinder/Results_Jul02/Orthogroups`. │   │ # Please familiarize yourself with the [OrthoFinder software](https://github.com/davidemms/OrthoFinder) │   │ # to better understand the many files in this directory. │   │ │   ├── individual_OG_fig.pdf # Contains the collated results summarizing the support for the ctenophore-sister hypothesis given the species │   │ # quartets from OrthoFinder. This was used in part to create Extended Data Figure 9 in the main manuscript. │   │ │   └─── rbh # The directory `rbh` contains the individual quartet analyses that were generated from running `run_OF.sh`. In each directory │      │ # there are files that detail the orthogroups that support the hypothesis of a particular species being the outgroup of the │      │ # species trio in question. These files will likely be of interest to readers of this Dryad repository. These files can be found in │      │ # this subdirectory, one example provided here: `rbh/COW_HCA_EMU_RES/odp_genome_mutation_analysis/sim_randomization/COW-HCA-EMU-RES/initialization`. │      │ │      │ │      ├── CLA_summary.sh # A shell script that summarizes the results of the simulations that include the cladorhizid sponge │      ├── *_*_*_*_reciprocal_best_hits.rbh # The .rbh file for the species quartet *-*-*-*, generated from the OrthoFinder orthogroups. See the manuscript for details. │      └── *_*_*_* │         ├── config.yaml # snakemake config file to run the odp_genome_mutation_analysis │         └── odp_genome_mutation_analysis │         ├── figures │         │   └── *-*-*-* # Figures for the species quartet *-*-*-* │         │   ├── *_*_*_*_randomization_marginals_%-randomizations.pdf # 2-axis marginal distribution heatmaps of the simulation results │         │   ├── *_*_*_*_randomization_stats_%-randomizations.txt # Text file with summary statistics of the analysis results │         │   └── *_*_*_*_support_%-randomizations.pdf # Summary results of shuffling the * species' genome │         └── sim_randomization │         └── *-*-*-* │         ├── initialization │         │   ├── *_*_*_*_df_of_fusionsUnmixed_%.txt # Linkage group fusion events in species * that are on a single chromosome, but are unmixed │         │   ├── *_*_*_*_df_of_groups_supporting_%_sister.txt # pandas df of linkage groups that support * as sister to the other two non-outgroup species │         │   ├── *_*_*_*_groups_supporting_%_sister.txt # list of orthologs that support species % as sister to the other two non-outgroup species │         │   ├── *_*_*_*_random_sim_init_plottable.txt # Parameters that were measured from the input genomes such as number of ALGs, genes, etc. │         │   ├── *_*_*_*_random_sim_init.txt # A list of measured parameters from the initial genomes, only used by software. │         │   └── *_*_*_*_rows_annotated.rbh.groupby # Contains annotations of which ALGs support which hypothesis, generated post-analysis. │         ├── plottable │         │   └── *-*-*-*_random_sim_%_&_plottable.txt # Contains simulation data summary stats for species *, set of trials %. │         ├── plottable_final │         │   └── *_*_*_*_random_sim_final_plottable_%.txt # Contains the concatenated simulation data from the ../plottable files above. For plotting. │         └── trials │         └── *-*-*-*_random_sim_%_&.txt # Contains the raw simulation data for species *, set of trials %. │  ├── SI_section_11_conservation_score # Contains the scripts and data used to generate the conservation score plots in the manuscript. │   ├── README.txt # Lists a script that must be downloaded from Zenodo in order to recreate these analyses. │   ├── OGs.rbh # An .rbh file of the OrthoFinder results from Supplementary Information 10. │   ├── SRO_and_TAD_conservation_score_max.pdf # A histogram of the maximum conservation score for each orthogroup. Orthologs on the y-axis. │   ├── SRO_and_TAD_conservation_score_plotted.pdf # The conservation score plotted in gene index coordinates for the species pair *_and_*. Cutoffs coded in script on Zenodo. │   ├── SRO_and_TAD_conservation_score_plotting.rbh # An .rbh file of the OrthoFinder results from Supplementary Information 10, but with the conservation scores added. │   ├── SRO_and_TAD_OG_conservation_edges.pd.tsv # A table of the conservation scores for each edge between orthologs in each orthogroup. │   └── SRO_and_TAD_sum_conservation_score.pdf # A histogram of the sum of the conservation scores for each orthogroup. See the script on Zenodo for details. ├── SI_section_12_GO_enrichment │   └── README.md # Points any directory users to Supplementary Data 3, published alongside the main manuscript. ├── SI_section_13_GeneMixingAnalysis │   └── README.md # Points any directory users to Supplementary Data 4, published alongside the main manuscript. ├── SI_section_14_BayesianMethodology │   └── README.md # Points any directory users to Supplementary Data 6, published alongside the main manuscript. └── SI_section_15_HypothesisTesting  ├── README.md # Points any directory users to Extended Data Figure 10 in the main manuscript to see the results plotted.    └── COW_EMU_HCA_RES    └── config.yaml # Example config file to run this genome shuffling analysis. ``` - - - END OF README