Skip to main content
Dryad logo

The evolution of C4 photosynthesis in Flaveria (Asteraceae): Insights from the Flaveria linearis complex


Stata, Matt et al. (2022), The evolution of C4 photosynthesis in Flaveria (Asteraceae): Insights from the Flaveria linearis complex, Dryad, Dataset,


Flaveria is a leading model for C4 plant evolution due to the presence of a dozen C3-C4 intermediate species, many of which are associated with a phylogenetic complex centered around F. linearis. To investigate C4 evolution in Flaveria, we updated the Flaveria phylogeny and evaluated gas exchange, starch δ13C, and activity of C4 cycle enzymes in 19 Flaveria species and 28 populations within the F. linearis complex. A principal component analysis identified six functional clusters: i) C3, ii) sub-C2, iii) full C2, iv) enriched C2, v) sub-C4, and vi) fully C4 species. The sub-C2 species lacked a functional C4 cycle, while a gradient was present in the C2 clusters from little to modest C4 cycle activity as indicated by δ13C and enzyme activities. Three Yucatan populations of F. linearis had photosynthetic CO2 compensation points equivalent to C4 plants but showed little evidence for an enhanced C4 cycle, indicating they have an optimized C2 pathway that recaptures all photorespired CO2 in the bundle sheath (BS) tissue. All C2 species had enhanced aspartate aminotransferase activity relative to C3 species and most had enhanced alanine aminotransferase activity. These aminotransferases form aspartate and alanine from glutamate and in doing so help return photorespiratory nitrogen (N) from BS to mesophyll cells, preventing glutamate feedback onto photorespiratory N assimilation. Their use requires upregulation of parts of the C4 metabolic cycle to generate carbon skeletons to sustain N return to the mesophyll, and thus could facilitate the evolution of the full C4 photosynthetic pathway.


Trinity (Grabherr et al. 2001) was used to generate de novo transcriptome assemblies for all RNA-seq data downloaded from the NCBI short read archive, as well as two species sequenced at the Center for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto (see Part 1 of online Supplemental Table S2 for phylogeny species list and data origin).  The longest open reading frames per locus were identified in the Trinity assemblies for all species using a Python script. These ORFs were translated to amino acids using the program transeq from the EMBOSS package (Rice et al., 2000) and these results were used for ortholog clustering using OrthoFinder (Emms and Kelly, 2019), along with 20 reference genomes from across the angiosperms (see part 2 of online Supplemental Table S2).  Orthogroups from 20 of the 22 focal taxa in the Flaveriineae and Jaumea were selected, and of these, only orthogroups that were apparently single copy for at least 75% of the taxa present were selected for further analysis.  These orthogroups were aligned at the protein level with mafft (Katoh and Standley, 2013) and then codon alignments were generated using Pal2Nal (Suyama et al., 2006).  A Python script was used to distinguish between fragmented assemblies and paralogs for species with more than one sequence in a given low-copy orthogroup.   If all sequences from a single species in an orthogroup did not overlap by more than 15bp, they were treated as fragments and joined together.  If they did overlap, that sample was rejected from the orthogroup, and if more than 15% of the taxa in an orthogroup were rejected in this way, the entire orthogroup was removed from the dataset.  Remaining orthogroups were trimmed with trimAl (Capella-Gutiérrez et al., 2009) using a gap tolerance of 0.5, and consensus sequences were generated using the cons program from the EMBOSS package.

For all genomic DNA samples, reads were mapped onto the set of consensus sequences generated using Hisat2 (Kim et al., 2019) with the scoring options --mp 10,8 and --score-min L-0.1 to 0.8, and only considering concordantly mapped pairs.  Variant sites were culled and new consensus sequences generated using samtools and bcftools (Danecek et al., 2021).  Because bcftools defaults to the reference sequence for regions with insufficient read coverage, a Python script was used to replace regions with less than 5 reads mapped with N's in the consensus sequences for each sample.  Consensus sequences based on the 14 gDNA samples were added to their respective orthogroup alignments.  A concatenated supermatrix and partition information file were generated with a Python script, and ML phylogenetic analysis was run with RAxML-NG with 100 bootstraps (Kozlov et al., 2019); the GTRGAMMAI model and separate model parameters allowed for all partitions. Our pipeline for repairing fragmented transcript assemblies and identifying single-copy orthogroups yielded 2,639 loci for phylogenomic analysis, totaling over 2.5 million nucleotide sites.  Maximum likelihood phylogenetic reconstruction produced a tree with 100% bootstrap support at all nodes, even within groups of infraspecific samples (Fig. 2).  A second, coalescent species tree was also inferred from the data using the multispecies coalescent model with the software package Astral (Zhang et al., 2018).  The same 2,639 orthologous loci used for the supermatrix-based concatenated tree were used here, with individual gene trees inferred by RAxML-NG (Kozlov et al., 2019).

Usage Notes

Sequence files are in FASTA format and trees are Newick. Any software capable of opening these will work.