Data from: One thousand plant transcriptomes and the phylogenomics of green plants

Mirarab, Siavash 1 ; Molloy, Erin 2 ; Leebens-Mack, Jim3; One Thousand Plant Transcriptomes

Published Nov 06, 2025 on Dryad. https://doi.org/10.5061/dryad.rn8pk0pr1

Data files

Nov 06, 2025 version files 8.10 GB

data-removed-seqs.txt

56.53 KB
FAA.zip

2.57 GB
FNA.zip

5.50 GB
keep-core-genome-labels.txt

362 B
keep-species-labels.txt

6.10 KB
metadata.zip

32.74 MB
README.md

5.26 KB

Abstract

Green plants (Viridiplantae) include around 450,000–500,000 species1,2of great diversity and have important roles in terrestrial and aquatic ecosystems. Here, as part of the One Thousand Plant Transcriptomes Initiative, we sequenced the vegetative transcriptomes of 1,124 species that span the diversity of plants in a broad sense (Archaeplastida), including green plants (Viridiplantae), glaucophytes (Glaucophyta), and red algae (Rhodophyta). Our analysis provides a robust phylogenomic framework for examining the evolution of green plants. Most inferred species relationships are well supported across multiple species tree and supermatrix analyses, but discordance among plastid and nuclear gene trees at a few important nodes highlights the complexity of plant genome evolution, including polyploidy, periods of rapid speciation, and extinction. Incomplete sorting of ancestral variation, polyploidization, and massive expansions of gene families punctuate the evolutionary history of green plants. Notably, we find that large expansions of gene families preceded the origins of green plants, land plants, and vascular plants, whereas whole-genome duplications are inferred to have occurred repeatedly throughout the evolution of flowering plants and ferns. The increasing availability of high-quality plant genome sequences and advances in functional genomics are enabling research on genome evolution across the green tree of life.

Dataset DOI: 10.5061/dryad.rn8pk0pr1

Description of the data and file structure

The files provided here include all the 14187 multi-copy gene families circumscribed in the onekp analyses. These were built using HMMs created from annotated genes from select genomes. See the paper for a full description.

Note that the main phylogenetic analyses reported in the onekp paper are only based on the single-copy gene data available on https://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/oneKP_capstone_2019. (DOI: 10.25739/8m7t-4e85)
These multi-copy gene families were previously available on http://jlmwiki.plantbio.uga.edu/onekp/v2/orthogroups but are no longer available. We make them available here instead.
Note that these are from the capstone version of the onekp project, which is different from the pilot version, published as http://doi.org/10.1073/pnas.1323926111
- Multi-copy gene trees from this earlier analyses are available here: A-Pro Dryad repo

Files and variables

File: metadata.zip

genes/[gene id]/data-copies-per-species-[gene id].csv: Number of copies for each species, plus the HMMID
genes/[gene id]/data-faalen-info-[gene id].csv: Statistics about the number of core sequences (genomes used to build HMMs), and the sequence length distribution in those cores. These were only generated for gene families with at least 300 sequences and they were generated after removing sequences for several filtering steps. Thus, they do not exist for small families.

File: keep-species-labels.txt

A list of all species included across the dataset.

File: keep-core-genome-labels.txt

List of 31 core genomes used in building HMMs. Any label in keep-species-labels.txt but not keep-core-genome-labels.txt belongs to a transcriptome (mapped using HMMs).

File: data-removed-seqs.txt

format:

<orthoid> : <sequence name>

We removed the sequences that couldn't be backtranslated from both the FNA and FAA files. This file gives the name of orthoid and the sequene removed due to lack of succesful back translation.

File: FAA.zip

All the amino acid sequences, given in FASTA format, for each gene, with files given names like genes/[gene id]/[gene id].input.FAA

File: FNA.zip

All the nucleotide sequences, given in FASTA format, for each gene, with files given names like genes/[gene id]/[gene id].input.FNA

Code/software

These are all plain text files (include FASTA) and can be viewed with any text editor.

Access information

Other publicly accessible locations of the data:

https://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/oneKP_capstone_2019

Data was derived from the following sources:

http://jlmwiki.plantbio.uga.edu/onekp/v2/orthogroups (not operational anymore)

Funding

The 1KP initiative was funded by the Alberta Ministry of Advanced Education and Alberta Innovates AITF/iCORE Strategic Chair (RES0010334) to G.K.-S.W., Musea Ventures, The National Key Research and Development Program of China (2016YFE0122000), The Ministry of Science and Technology of the People’s Republic of China (2015BAD04B01/2015BAD04B03), the State Key Laboratory of Agricultural Genomics (2011DQ782025) and the Guangdong Provincial Key Laboratory of core collection of crop genetic resources research and application (2011A091000047). Sequencing activities at BGI were also supported by the Shenzhen Municipal Government of China (CXZZ20140421112021913/JCYJ20150529150409546/JCYJ20150529150505656). Computation support was provided by the China National GeneBank (CNGB), the Texas Advanced Computing Center (TACC), WestGrid and Compute Canada; considerable support, including personnel, computational resources and data hosting, was also provided by the iPlant Collaborative (CyVerse) funded by the National Science Foundation (DBI-1265383), National Science Foundation grants IOS 0922742 (to C.W.d., P.S.S., D.E.S. and J.H.L.-M.), IOS-1339156 (to M.S.B.), DEB 0830009 (to J.H.L.-M., C.W.d., S.W.G. and D.W.S.), EF-0629817 (to S.W.G. and D.W.S.), EF-1550838 (to M.S.B.), DEB 0733029 (to T.W. and J.H.L.-M.), and DBI 1062335 and 1461364 (to T.W.), a National Institutes of Health Grant 1R01DA025197 (to T.M.K., C.W.d. and J.H.L.-M.), Deutsche Forschungsgemeinschaft grants Qu 141/5-1, Qu 141/6-1, GR 3526/7-1, GR 3526/8-1 (to M.Q. and I.G.) and a Natural Sciences and Engineering Research Council of Canada Discovery grant (to S.W.G.). We thank all national, state, provincial and regional resource management authorities, including those of province Nord and province Sud of New Caledonia, for permitting collections of material for this research.