Data from: Horizontal transfer of an adaptive chimeric photoreceptor from bryophytes to ferns
Li, Fay-Wei, Duke University
Published Mar 24, 2015 on Dryad.
https://doi.org/10.5061/dryad.fn2rg
Cite this dataset
Li, Fay-Wei (2015). Data from: Horizontal transfer of an adaptive chimeric photoreceptor from bryophytes to ferns [Dataset]. Dryad. https://doi.org/10.5061/dryad.fn2rg
Abstract
Ferns are well known for their shade-dwelling habits. Their ability to thrive under low-light conditions has been linked to the evolution of a novel chimeric photoreceptor—neochrome—that fuses red-sensing phytochrome and blue-sensing phototropin modules into a single gene, thereby optimizing phototropic responses. Despite being implicated in facilitating the diversification of modern ferns, the origin of neochrome has remained a mystery. We present evidence for neochrome in hornworts (a bryophyte lineage) and demonstrate that ferns acquired neochrome from hornworts via horizontal gene transfer (HGT). Fern neochromes are nested within hornwort neochromes in our large-scale phylogenetic reconstructions of phototropin and phytochrome gene families. Divergence date estimates further support the HGT hypothesis, with fern and hornwort neochromes diverging 179 Mya, long after the split between the two plant lineages (at least 400 Mya). By analyzing the draft genome of the hornwort Anthoceros punctatus, we also discovered a previously unidentified phototropin gene that likely represents the ancestral lineage of the neochrome phototropin module. Thus, a neochrome originating in hornworts was transferred horizontally to ferns, where it may have played a significant role in the diversification of modern ferns.
Usage notes
IGPD_alignment
Alignment of land plant imidazoleglycerol-phosphate dehydratase (IGPD). Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
IGPD_ML_tree_raxml
The maximum likelihood tree inferred from "IGPD_alignment.fas", using RAxML with 100 random starting trees. We partitioned the data by codon position, with each partition given a GTR+Γ+I model as suggested by PartitionFinder under the Akaike Information Criterion. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
IGPD_MLBS_tree_raxml
The maximum likelihood bootstrapping trees from "IGPD_alignment.fas", using RAxML (1000 replicates). We partitioned the data by codon position, with each partition given a GTR+Γ+I model as suggested by PartitionFinder under the Akaike Information Criterion. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
PHOT_alignment
Alignment of plant phototropin and neochrome, containing 163 sequences from 106 species. We only included the conserved domains (i.e., LOV1, LOV2 and STK); the domain boundaries were identified by querying each scaffold against the NCBI Conserved Domain Database. Each domain was separately aligned (based on the amino acid sequences) using Muscle, and then concatenated. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
PHOT_ML_tree_garli
The maximum likelihood tree inferred from "PHOT_alignment.fas", using Garli with genthreshfortopoterm set to 1,000,000 and 8 independent runs. We partitioned the data by codon position, with each partition given a GTR+Γ+I model as suggested by PartitionFinder under the Akaike Information Criterion. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
PHOT_ML_tree_codonPhyML
The maximum likelihood tree inferred from "PHOT_alignment.fas", using CodonPhyML. We used the GY model with four categories of non-synonymous/synonymous substitution rate ratios drawn from the discrete gamma distribution, and codon frequencies were estimated from the data under the F3X4 model. The tree topology search was done using the NNI approach, and branch support was estimated using the SH-like aLRT method. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
PHOT_ML_tree_nhPhyml
The maximum likelihood tree inferred from "PHOT_alignment.fas", using nhPhyML. The analysis was carried out with ten discrete categories of GC equilibrium frequencies, and the required starting tree was the best tree from the Garli analysis ("PHOT_ML_tree_garli.tre"). Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
PHOT_MLBS_tree_raxml
The maximum likelihood bootstrapping trees from "PHOT_alignment.fas", using RAxML (1000 replicates). We partitioned the data by codon position, with each partition given a GTR+Γ+I model as suggested by PartitionFinder under the Akaike Information Criterion. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
PHOT_MLBS_tree_nhPhyml
The maximum likelihood bootstrapping trees from "PHOT_alignment.fas", using nhPhyML (1000 replicates). The analysis was carried out with ten discrete categories of GC equilibrium frequencies, and for each replicate, RAxML was used to input the starting tree. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
PHOT_BI_con_tree_MrBayes
The 50% majority consensus tree from MrBayes run (25% of the total generations were discarded as burn-in), based on "PHOT_alignment.fas". The analysis was carried out with two independent MCMC runs, four chains each, and trees sampled every 1000 generations. Substitution parameters were unlinked and the rate prior was set to vary among partitions. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
PHOT_BI_chronogram_BEAST
The chronogram of plant phototropin and neochroem, inferred from "PHOT_alignment.fas", using BEAST. A total of 15 tmrca priors were employed as the calibration points (see SI Appendix), and a birth-death speciation prior was used as the tree prior. We used the uncorrelated relaxed-clock model with rates drawn from a lognormal distribution. A starting tree was first estimated by r8s and provided to BEAST to initiate the run. Two independent MCMC runs were carried out and the output was inspected in Tracer to ensure convergence and mixing (effective sample sizes all > 200). The trees from the two runs were combined in LogCombiner with a 25% burn-in and summarized in TreeAnnotator. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
PHY_alignment
Alignment of plant phytochrome and neochrome, containing 139 sequences from 76 species. We only included the conserved domains (i.e., PAS, GAF, PHY, PAS repeats, HisKA and HATPase); the domain boundaries were identified by querying each scaffold against the NCBI Conserved Domain Database. Each domain was separately aligned (based on the amino acid sequences) using Muscle, and then concatenated. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
PHY_ML_tree_garli
The maximum likelihood tree inferred from "PHY_alignment.fas" (translated into amino acids), using Garli with genthreshfortopoterm set to 1,000,000 and 8 independent runs. Using ProtTest (65), JTT + F was found to be the best empirical substitution model under the Akaike Information Criterion. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
PHY_ML_tree_codonPhyML
The maximum likelihood tree inferred from "PHY_alignment.fas", using CodonPhyML. We used the GY model with four categories of non-synonymous/synonymous substitution rate ratios drawn from the discrete gamma distribution, and codon frequencies were estimated from the data under the F3X4 model. The tree topology search was done using the NNI approach, and branch support was estimated using the SH-like aLRT method. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
PHY_MLBS_tree_raxml
The maximum likelihood bootstrapping trees from "PHY_alignment.fas" (translated into amino acids), using RAxML (1000 replicates) under JTT + F model. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
PHY_BI_con_tree_MrBayes
The 50% majority consensus tree from MrBayes run (25% of the total generations were discarded as burn-in), based on "PHY_alignment.fas" (translated into amino acids). The analysis was carried out with two independent MCMC runs, four chains each, trees sampled every 1000 generations and JTT + F model. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
NEO_alignment
Alignment of fern and hornwort neochrome.
NEO_ML_tree_garil
The maximum likelihood tree inferred from "NEO_alignment.fas", using Garli with genthreshfortopoterm set to 1,000,000 and 8 independent runs. We partitioned the data by codon position, and GTR+Γ+I, GTR+Γ+I, GTR+I models were applied to each codon position respectively. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
NEO_ML_tree_pos12_garli
The maximum likelihood tree inferred from "NEO_alignment.fas" (third codon excluded), using Garli with genthreshfortopoterm set to 1,000,000 and 8 independent runs. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
NEO_ML_tree_pos3_garli
The maximum likelihood tree inferred from "NEO_alignment.fas" (first and second codon excluded), using Garli with genthreshfortopoterm set to 1,000,000 and 8 independent runs. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
NEO_ML_tree_codonPhyML
The maximum likelihood tree inferred from "NEO_alignment.fas", using CodonPhyML. We used the GY model with four categories of non-synonymous/synonymous substitution rate ratios drawn from the discrete gamma distribution, and codon frequencies were estimated from the data under the F3X4 model. The tree topology search was done using the NNI approach, and branch support was estimated using the SH-like aLRT method. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
NEO_MLBS_tree_pos12_raxml
The maximum likelihood bootstrapping trees from "NEO_alignment.fas", using RAxML (1000 replicates; third codon excluded). Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
NEO_MLBS_tree_pos3_raxml
The maximum likelihood bootstrapping trees from "NEO_alignment.fas", using RAxML (1000 replicates; first and second codon excluded). Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
NEO_BI_con_tree_MrBayes
The 50% majority consensus tree from MrBayes run (25% of the total generations were discarded as burn-in), based on "NEO_alignment.fas". The analysis was carried out with two independent MCMC runs, four chains each, and trees sampled every 1000 generations. We partitioned the data by codon position, and GTR+Γ+I, GTR+Γ+I, GTR+I models were applied to each codon position respectively. Substitution parameters were unlinked and the rate prior was set to vary among partitions. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.
BlueDevil
A Python scripts to extract gene homologs from 1KP transcriptomes. Sequences for gene-of-interest are queried by tBLASTn and the significant hits to transcriptome scaffolds are extracted. For each scaffold, the best open reading frame is identified, and the sequence is translated into amino acids and then BLASTp queried against the NCBI non-redundant protein database (nr). The scaffolds is discarded if they did not match the homologs in the nr database with an e-value threshold of <0.001. The filtered scaffolds from SOAP de novo and SOAP de novo trans assemblies are then merged using CAP3.
bluedevil_settings
The configuration file for BlueDevil.py
DomainDivider_phot
A Python scripts to build phototropin alignment based on conserved domains (i.e. LOV1, LOV2 and STK). The search results from NCBI Conserved Domain Database is parsed to identify domain boundaries and extract domain sequences. Each domain is separately aligned (based on the amino acid sequences) using Muscle, and then concatenated.
DomainDivider_phy
A Python scripts to build phytochrome alignment based on conserved domains (i.e. PAS, GAF, PHY, PAS repeats, HisKA and HATPase). The search results from NCBI Conserved Domain Database is parsed to identify domain boundaries and extract domain sequences. Each domain is separately aligned (based on the amino acid sequences) using Muscle, and then concatenated.
DomainDivider_neo
A Python scripts to extract conserved domains in neochrome (i.e. PAS, GAF, PHY, LOV1, LOV2, and STK). The search results from NCBI Conserved Domain Database is parsed to identify domain boundaries and extract domain sequences.
Readme
NEO_MLBS_tree_raxml
The maximum likelihood bootstrapping trees from "NEO_alignment.fas", using RAxML (1000 replicates). We partitioned the data by codon position, and GTR+Γ+I, GTR+Γ+I, GTR+I models were applied to each codon position respectively. Alphanumeric codes following species names are the four-letter 1KP transcriptome identifiers, Genbank accessions or both.