Skip to main content

Phylogenomic analyses of ochrophytes (stramenopiles) with an emphasis on neglected lineages

Cite this dataset

Cho, Anna; Lax, Gordon; Keeling, Patrick; Keeling, keelab (2024). Phylogenomic analyses of ochrophytes (stramenopiles) with an emphasis on neglected lineages [Dataset]. Dryad.


Ochrophyta is a photosynthetic lineage that crowns the phylogenetic tree of stramenopiles, one of the major eukaryotic supergroups. Due to their ecological impact as a major primary producer, ochrophytes are relatively well-studied compared to the rest of the stramenopiles, yet their evolutionary relationships remain poorly understood. This is in part due to a number of missing lineages in large-scale multigene analyses, and an apparently rapid radiation leading to many short internodes between ochrophyte subgroups in the tree. These short internodes are also found across deep-branching lineages of stramenopiles with limited phylogenetic signal, leaving many relationships controversial overall. We have addressed this issue with other deep-branching stramenopiles recently, and now examine whether contentious relationships within the ochrophytes may be resolved with the help of filling in missing lineages in an updated phylogenomic dataset of ochrophytes, along with exploring various gene filtering criteria to identify the most phylogenetically informative genes. We generated ten new transcriptomes from various culture collections and single-cell isolation from an environmental sample, added these to an existing phylogenomic dataset, and examined the effects of selecting genes with high phylogenetic signal or low phylogenetic noise. For some previously contentious relationships, we find a variety of analyses and gene filtering criteria consistently unite previously unstable grouping with strong statistical support. For example, we recovered a robust grouping of Eustigmatophyceae with Raphidophyceae-Phaeophyceae-Xanthophyceae while Olisthodiscophyceae formed a sister lineage to Pinguiophyceae. Selecting genes with high phylogenetic signal or data quality recovered more stable topologies. Overall, we find that adding under-represented groups across different lineages is still crucial in resolving phylogenomic relationships, and discrete gene properties affect lineages of stramenopiles differently which may be further explored to better understand the molecular evolution of stramenopiles.

README: Phylogenomic analyses of ochrophytes (stramenopiles) with an emphasis on neglected lineages

Supplementary tree files, supermatrices, and calculated gene properties generated from different gene-filtering criteria are described below.

Description of different gene-filtering criteria

Genes were selected from a 241-gene-set based on different gene filtering criteria described below. The resulting genes were then used to concatenate supermatrices that were then used to reconstruct phylogenomic trees. Therefore, each of the supermatrices and treefiles were labelled based on gene-filtering criteria and their size parameters (e.g., A100 indicates genes selected based on A-criteria with top 100 values; S180 indicates genes selected based on S criteria with top 180 highest values). 

(A) high values of treeness and occupancy
(B) high values of average_bootstrap(BS)_support, Robinson-Foulds similarity (robinson_sim), and gene length
(C) low values of average patristic distance (av_patristic), evolutionary rate, and total tree length
(D) filtering out high values of av_patristic, evolutionary rate, and total tree length
(E) high values of Principal component axis 1 (PC1)-associated noise (root_to_tip_variance, av_patristic, and saturation)
(F) high values of all noise (including relative composition frequency variability, RCFV)
(N) Neutral genes - not well explained by the gene properties
(S) high values of signal (treeness, average_BS_support, robinson_sim)
(Q) high values of data quality (occupancy and gene length)

Description of the data and file structure

We provide four .zip files, each with relevant files obtained from different gene-filtering criteria (i.e., A-F, N, S, and Q) described above :

  1. A folder labelled "supermatrices" contains 47 *.fas files from each gene-filtering criteria with different sizes. The fas files are labelled based on the filtering criteria followed by the top highest or the lowest values for each gene property (i.e., "A80.fas", "C100.fas", and "ABC100.fas"). For example, "A60.fas" is a supermatrix consisting of shared genes with the top 60 highest values of treeness and occupancy and "B80.fas" is a supermatrix consisting of shared genes with the top 80 highest values of average_bootstrap(BS)_support, Robinson-Foulds similarity (robinson_sim), and gene length. "ABC120.fas" is a supermatrix combining genes from "A120.fas", "B120.fas", and "C120.fas". Only 29 of these supermatrices (A-D and ABC120-160.fas; E and F140-180.fas; N.fas; Q120-180.fas; S140-180.fas) and subsequent tree files (see #4) were used to interpret the effect of different gene filtering criteria on phylogenomic relationships (see Fig.2 of the manuscript) 
  2. A folder labelled "gene_indices" contains 49 *_indices.tsv of each gene-filtering criteria. Each of these files lists genes used to build each of the supermatrix described in #1 (see above) and the position of genes within the corresponding supermatrix. "231supermatirx_indices.tsv", and "233supermatrix_indices.tsv" include lists of genes used to construct supermatrices that were not filtered.
  3. A folder labelled "calculated_properties" contains 29 *_properties_sorted_dataset.csv files. Each of these files lists values of calculated 13 gene properties (treeness, occupancy, average_BS_support, robinson_sim, RCFV, gene length, av_patristic, evolutionary rate, tree length, root_to_tip_variance, saturation, PC1 and PC2 values) for each gene used to construct supermatrices based on different gene filtering criteria. A detailed description of the 13 gene properties has been put together by Mongiardino Koch N. (2021) along with relevant references. In summary, we provide a brief description of each property as follows:
    1. treeness: Proportion of tree length on internal branches (Lanyon 1988)
    2. occupancy: the presence of percent amino acid sites (out of an entire alignment length) for each taxon or transcriptome
    3. average_BS_support: average non-parametric bootstrap values for each node with 100 replicates
    4. robinson_sim: Robinson-Foulds Similarity (Robinson and Foulds 1981), a measure of congruence between a gene tree and the species tree
    5. RCFV: Relative composition frequency variability (Zhong et al., 2011), a proxy for compositional heterogeneity among tree terminals (i.e., a measure of each amino acid usage for each taxon).
    6. gene length: length of the gene alignment (i.e., number of amino acid sites)
    7. av_patristic: average pair-wise patristic distance, average value of the patristic distance (Struck 2014)
    8. evolutionary rate: calculated by the total branch length of a tree divided by the number of taxa or terminals
    9. tree length: total number of branch lengths
    10. root_to_tip_variance: variance of root-to-tip distances, a proxy for inferred mutations from their most common ancestor
    11. saturation: a linear regression slope derived from a relationship between pair-wise patristic distance of a gene tree and its p-distance in corresponding alignments. This value is subtracted from 1. (Nosenko et al., 2013)
    12. PC1 and PC2 values: contribution of each gene properties to Principal component 1 and 2 axes
    13. proportion of missing data: proportion of missing data (i.e., amino acid sites) for each taxon
  4. A folder labelled "treefiles" includes 86 *.treefiles. These are phylogenomic treefiles reconstructed from supermatrices consisting of genes selected using different gene filtering criteria described above (also see #1).  Additionally, we include treefiles reconstructed based on randomly selected genes (randGene.treefile) or randomly removed amino acid sites (randSite.treefile). For trees reconstructed from randomly selected genes, "20rep1.randGene.treefile" indicates that 20% of the genes out of 231 genes were randomly selected and these were repeated 14 times (i.e., rep1 to rep 14 in the file names). For trees reconstructed from randomly removing % amino acid sites out of a total of 72,932 sites, the files are named to depict this. For example "randSite.0.1.treefile" indicates that 10% of amino acid sites were randomly removed to reconstruct the tree. "231-supermatrix_C60PMSF.treefile" and "233-supermatrix_C60PMSF.treefile" are the main trees reconstructed without filtering any genes. "231-supermatrix_CAT-PMSF_chain1.treefile" and "231-supermatrix_CAT-PMSF_chain2.treefile" are reconstructed using a different tree inference method. 


Tice et al., (2016):

Mongiradino Koch, (2021):

Lanyon, S.M. (1988):

Robinson, D.F. and Foulds, L.R. (1981):

Zhong, M., Hansen, B., Nesnidal, A., Golombek, K.M., Halanych, T.H., Struck, T.H. (2011):

Struck, T.H. (2014):

Nosenko, T., et al., (2013):

Sharing/Access Information

The orthologs were searched against the 241 genes archived in PhyloFisher v1.1.2 by Tice et al. (2021)

The 13 gene properties were calculated using a modified R-script available in, written by Mongiardino Koch N. 2021 (

**For additional queries, please contact the corresponding author at: or


Nine cultures of under-represented ochrophytes (including Olisthodiscophyceae, Schizocladiophyceae, Picophagea, and Phaeothamniophyceae) were obtained from various culture collections, in addition to manually isolated Vicicitus globosus from environmental samples. Except for V. globosus, we extracted RNA using TRIzol or CTAB. RNA extracts (and single-cell isolates of V. globosus) were then subject to poly-A selection-based Smart-Seq2 protocol cDNA synthesis. The sequencing libraries were prepared using Illumina Flex Library Preparation Kit followed by Illumina NextSeq (150bp paired-end) mid-output transcriptomic sequencing. 

Filtered and processed transcriptomes were then used to look for orthologs archived in PhyloFisher v1.1.2 ( which were then used to concatenate the main supermatrix ('231-supermatrix'). 

Using various gene properties calculated from the '231-supermatrix' using R-scripts posted on, we subsampled genes based on different criteria outlined in the manuscript. The subsequent phylogenomic trees were inferred under the profile mixture model LG+C60+F+G4 with PMSF using IQ-TREE v2.1.2.


Natural Sciences and Engineering Research Council, Award: 2014-03994

Natural Sciences and Engineering Research Council, Award: 2019-535515, CGSD

University of British Columbia, Botany 4YF