Code and data for: The evolution of ontogenetic decision-making in the wood of a clade of tropical plants
Cite this dataset
Olson, Mark E.; Benı́tez, Mariana; Lárraga-Ramírez, Marı́a Elena; Petrone-Mendoza, Emilio (2023). Code and data for: The evolution of ontogenetic decision-making in the wood of a clade of tropical plants [Dataset]. Dryad. https://doi.org/10.5061/dryad.wstqjq2rx
Abstract
Code and data for reproducing the analysis from the manuscript "The evolution of ontogenetic “decision-making” in the wood of a clade of tropical plants". Data includes raw and curated data from cell lineages derived from wood progenitor cells. Python code is provided to generate virtual wood cell lineages using L-systems. It also provides code to determine number of words at different k-mer lengths and to estimate Shannon-Entropy and Lempel-Ziv values from the coded cell lineages. R code is provided to generate the plots and to adjust linear models of the maximum number of words based on total number of coded cells and mean cell lineage length.
README: The evolution of ontogenetic decision-making in the wood of a clade of tropical plants.
Description of the data and file structure
This repository contains the data and code used for reproducing the results and figures of the manuscript titled, "The evolution of ontogenetic decision-making in the wood of a clade of tropical plants," by Emilio Petrone-Mendoza, Mariana Benítez, María Elena Lárraga, and Mark E. Olson (UNAM, Mexico).
The repository is divided in:
- Data: Data that we generated from the wood cell lineages from Pedilanthus clade species of the genus Euphorbia.
- Raw_data/
- Cell_files_data
- words_count_all
- words_count_morethanone/
- Figures: figures of the manuscript that we generated with scripts and data
- meta: additional tables and information from the samples.
- scripts: executables for the analysis of cell lineage data. Scripts are written in python and other in R.
Description of the Data folder.
- Data:
- Raw_data/:
- 892_edited.txt
- 896_edited.txt
- 939_edited.txt
- Raw_data/:
Each file within the Raw_data folder comes from a different sample. The alphanumeric code before the _edited.txt is from the sample. The sample_edited.txt files are tab delimited files containing three columns: the first one specifying the species of the sample, the second has the cell lineage number, and the third one has the sequence of coded cells.
Species | Cell file number | Cell lineage sequence |
---|---|---|
E. calcarata | C1 | FFFFVPFFFPF... |
E. calcarata | C2 | FFFFFFPFPFP... |
In addition to the cell-to-letter code from wood cells, some cell lineage sequences contain three additional metadata with the following syntax and biological meaning:
Wood cells delimited by parenthesis and a ^ character (i.e. ^P). This code represents cells that, can be intrusive gorwth fibers or other cells interrupting the continuous series of cells derived from the coded lineage (See Appendix A for additional information).
A Converge- word delimited by parentheses (Converge-). Two lineages that converge into one cell by anticlinal divisions (2 cell lineages converge into one cell lineage) were marked with the (Converge-). Only one of the two derived cell lineages was coded in the same cell file, and then processed to be another cell lineage.
Hyphens (-) at the beginning of some cell lineages. Some wood cell lineages were not visible from the last-differentiated wood cells at the vascular cambium, instead beginning internally. Series of cell lineages identified nearby other coded cell lineages, we coded cells starting at the relative position of the other cells.
Data preprocessing
Using awk commands we removed the first two columns of each file and we redirected the file to a new file within the Cell_files_data folder. We used Sed commands to remove the meta coding because we did not analyze these data for the present work.\
Here are some details of the commands use for processing the text files:
awk '{print $4'} Data/Raw_data/845_edited.txt > Data/Raw_data/Cell_files_data/845_edited_cells.txt
sed -i 's/(Converge-)//g' Data/Raw_data/P_macrocarpus/EPM13_edited_cells.txt
#to add cells as a new cell file after codifying a convergence event add:
sed -Ei 's/\(Converge-\)/&\n/g' 845_edited_cells_NotConverge.txt
We moved the edited files to the Data/Cell_files_data. Files with convergence events coded as separate cell lineages were saved in Data/Cell_files_data/ConvergeAsOtherLineage
Workflow
Global functions were created and located in the wordanalysis.py python script. The script has functions to count number and type of cells, length of lineages, word counting, estimating euclidean distance, and measuring Lempel-Ziv compression and Shannon-Entropy estimation methods. The l-system.py script generates the virtual wood cell lineages. Before starting the cell lineage analysis we ran the l-system.py to include the virtual cell lineages with the rest of the Pedilanthus lineages.
The workflow can be divided in two parts. The first part includes python scripts returning data frames with values from the cell lineage analysis, such as word counts, homogeneity indexes values, or cell lineage lengths. The second part includes r scripts returning figures and performing statistical tests.
Cell lineage analysis
The order in which we analyzed data is the following:
First part. Python scripts:
To determine the length of cell lineages and cell type frequencies we ran the count_longitudes.py which generated a file named cell_lengths_notConverge.csv containing the lengths of each cell lineage from each individual. The cell_lengths_notConverge.csv is placed in Data/. Also, the script returns a file named cell_lengths_withoutR.csv with the cell lengths of the lineages without including ray cells.
To determine the total number of words at different k-mer lengths we ran the count_transitions.py script, which counts the number of cell types and numbers of words and numbers of words appearing more than once (words from k-mer length 2 to 34). Output of the word counts at each k-mer length are named as wordcounts*n*.csv, where n is the k-mer. Files are located in Data/word_counts_all and in Data/word_counts_morethanone. Additionally, the script generates two files summarizing the total number of words observed at each k-mer length, from 2 to 34, for each individual. One file is for all the words and the other for the words appearing more than once. The files are named wordcounts_all.csv.
To determine the Lempel-Ziv compression algorithm values and the Shannon entropy values of the cv lineages from our individuals we ran the wordcomplexity.py script which generates one file named shannonentropy.csv and another named lemplzivbyfile.csv.
To determine the homogeneity index values for each cell lineage we ran the homogeneity_index.py script which generates returns a data frame with the values for all cell lineages and individuals. The name of the file is homogentiy_index.csv and is located in the Data folder.
Second part. Rscripts
The second part performs statistical tests and generates figures. In the following list we describe the content and what each different script does:
- information_theoryMetrics.R: we make plots from the Lempel-Ziv compression algorithm and the Shannon entropy values.
- homogenity_index.r: we make plots for the homogenity indexes.
- distance_plots.R: we make plots related to word counts, cell lengths, and dissimilitud analysis. We run the Bray-Curtis dissimilarity metric using the vegan package, and the NDMS.
- geographic_info.R: in this script we extract climate information based on the geographic coordinates of the sampled individuals.
Usage notes
R, python
Funding
Consejo Nacional de Humanidades, Ciencias y Tecnologías, Award: A1-S-26934