Data from: Experimental validation of genome-environment associations in Arabidopsis
Data files
Oct 09, 2025 version files 170.47 KB
-
Archive.zip
160.48 KB
-
README.md
9.99 KB
Abstract
This dataset supports an experimental validation study investigating candidate loci identified by common genotype-environment association (GEA) approaches in Arabidopsis thaliana. Using t-DNA knockout lines, we tested 42 genes identified from three moisture-related GEA studies. A drought screen was conducted to assess the effects of genotype, treatment, and genotype-by-treatment (GxTrt) on important performance and fitness traits under drought. A follow-up milder drought experiment and an overnight freezing experiment were also conducted on lines showing significant GxTrt for fitness and flowering time in the main drought screen. Besides raw data measured during the experiments, this dataset also contains scripts we used to perform data analyses.
https://doi.org/10.5061/dryad.qrfj6q5s4
Description of the data and file structure
Trait data were measured by hand from controlled laboratory experiments. Water use efficiency was estimated from dried leaf materials. Ecophysiological data were measured on live plant leaves using a LI-600. All genetic data used in the study are from the published 1001 Genomes dataset (https://1001genomes.org/data/GMI-MPI/releases/v3.1/).
Files and scripts
Description of the data and file structure
The Archive.zip file contains an R script and three folders, each providing the raw data and scripts used for analyzing the main drought screen, follow-up experiments, and allelic variation. The file types include:
+.csv: Trait data records
+.R/.r: R scripts, running under R version 4.3.3
+.sh: Shell scripts, running under Linux OS.
Input files required by the scripts, if not provided, can be downloaded or generated by following the instructions provided in the annotations. If scripts in the same folder are organized in numerical order, ensure you follow this sequence, as outputs from earlier scripts may be required as inputs for subsequent scripts.
0_ChooseCandidateGenes.R
R script used to generate gene candidates for drought screen from five genome-environment association (GEA) lists across three GEA studies (for details, see script annotations).
Folder 1: 1_Main_screen
LaskyJ Lasky_tDNA2017 CN 1219_Samples.csv
Isotope data (13C, 15N, C content, N content) of Columbia-0, wrky38, and lsd1 based on leaf samples from the main drought screen.
- Plant.ID: Unique ID for each plant in the experiment
- Knockout.line: Plant Genotype
- Tray#: Tray number
- PositionTray: Position ID in each tray
- Trt: Treatments. WW represents well-watered. D represents drought
- 13CVPDB: 13C isotope concentration (‰)
- Total.C(miug): Carbon content in sample (mg)
- 15NAir: 15N isotope concentration (‰)
- TotalN(miug): Nitrogen content in sample (mg)
tDNA2017_DATA_8_7_19_used.csv
Trait data measured from the main drought screen. Missing values are left blank. Knockout lines 'Col for P S revision', 'Ler', 'LCN 5-15', 'LCN 5-18', 'LCN 5-19', 'CVI', 'SALK_126996', 'SALK_132270', 'SALK_027055', and 'SALK_142177C', are not for this study, so their trait values are left blank and will be filtered out in the script.
'dummy cone-tainer 1', 'dummy cone-tainer 2', 'dummy cone-tainer 3', 'dummy cone-tainer 4', and 'dummy cone-tainer 5' are used to place soil moisture probes, so traits were not measured from plants in these pots. In other circumstances, a blank cell in GermDate, FlowerDate and any other fitness measurement, or Infl_Length_cm means that the seed did not germinate, or the plant did not flower to reproduce, or the plant did not have a main inflorescence (i.e., having multiple inflorescences but can not tell which is the main inflorescence).
- Accession: Arabidopsis gene identifier of the knockout gene
- Knockout line: SALK line ID of the knockout line
- Plant.ID: Unique ID for each plant in the experiment
- Tray: Tray number
- PositionTray: Position ID in each tray
- Trt: Treatments. WW represents well-watered. D represents drought
- GermDate: Date of germination
- FlowerDate: Date of flowering
- Infl_Length_cm: Main inflorescence length (cm)
- Total_Num_Pods: Silique number
- Pod1-Pod6: Silique length (mm)
- Infl_wt_g: Inflorescence weight (g)
- Rosette_wt_g: Rosette weight (g)
- Leaf_Temp: Leaf temperature (°C)
1_Main_screen_analyses.R
R script used to apply mixed linear models (LMMs) between Columbia-0 (Col) and each of the 44 t-DNA knockout lines from the main drought screen. Requires .csv files in the same folder as input.
Folder 2: 2_Follow-up
Folder: LI-COR_tracking
This folder contains output files generated by LI-COR 600 porometer/fluorometer for 10 over 10 consecutive days, as well as the R script licor_summary.R, which is used to analyze changes in stomatal conductance (gsw) and Fv/Fm using LMMs. The output files were named following the generic format: "YL_TIMEPOINTYuxin_LuoDATExxxxx.csv", where TIMEPOINT shows the measurement was taken pre-watering (light_pre_wt), after watering (light_aft_wt), or after all lights were off for 30 minutes (Dark). Two measurements (light_pre_wt and Dark) were taken on non-watering days, while three were taken on watering days (light_pre_wt, light_aft_wt, and Dark). In each raw data table, licor_summary.R only uses column J (gsw: stomatal conductance(mol/m2/s)) and column Y (Fv/Fm).
Final_measurement_table_summary.csv
Trait data measured from the follow-up drought experiment. Missing values are left blank. Plants that did not germinate or survive have no values in any of the trait columns. The rest of the plants were either used for a destructive harvest or a final harvest. Destructively harvested plants only have values in columns Rosette_saturated_mg, Rosette_fresh_mg, Rosette_dry_mg, Leaf_dry_mg, and Leaf_area_mm2. Plants that survived until the final harvest only have values in the remaining columns.
- Column A: Plant ID. Plant consists of genotype (Col, wrky38, or lsd1) + treatment (F for freezing) + Individual ID (integers).
- Pod_number: Silique number
- Branch_N: Number of side branches on the main inflorescence.
- Pod_1-Pod_12: Silique length (cm)
- Height_cm: Main Inflorescence length (cm)
- GerminationDate: Date of germination
- FloweringDate: Date of flowering
- Rosette_saturated_mg: Rosette weight measured after putting it in distilled water overnight.
- Rosette_fresh_mg: Rosette weight (mg) measured right after harvest.
- Rosette_dry_mg: Rosette weight (mg) measured after completely dried in the oven.
- Leaf_dry_mg: Leaf weight (mg) measured after leaves were completely dried in the oven.
- Leaf_area_mm2: Leaf area (mm2) estimated from scanned pictures using ImageJ.
LuoY 230621 CN Tray1-YL P137374.csv
Isotope data (13C, 15N, C content, N content) of Col, wrky38, and lsd1 based on leaf samples from the follow-up drought experiment. The sample consists of genotype (Col, wrky38, or lsd1) + treatment (WW for well-watered, D for drought) + Individual ID (integers). For example: Col_WW_2.
- SampleID: Sample consists of genotype (Col, wrky38, or lsd1) + treatment (WW for well-watered, D for drought) + Individual ID (integers). For example: Col_WW_2.
- 13CVPDB: 13C isotope concentration (‰)
- TotalC: Carbon content in sample (mg)
- d15NAir: 15N isotope concentration (‰)
- TotalN: Nitrogen content in sample (mg)
Follow-up_drought.R
R script used to apply mixed linear models (LMMs) between Col and wrky38 or lsd1 from the follow-up intermediate drought (ID) experiment. Requires csv files in the same folder as input.
Freezing_measurements_summary.csv
Trait data measured from the follow-up freezing experiment. Missing values are left blank. Plants with only germination_time were harvested before flowering and reproducing, thus have no values in the remaining columns except rosette diameter. Plants having a flowering time but no other fitness traits (silique number and length) died during freezing stress.
- Column A: Plant ID. Plant Ire consists of genotype (Col, wrky38, or lsd1) + treatment (F for freezing) + Individual ID (integers).
- Tray: Tray number
- Diameter: Rosette diameter (cm)
- Pod_number: Silique number
- Height: Inflorescence length (cm)
- Pod_1-Pod_12: Silique length (cm)
- Abvgrd_biomass: Aboveground biomass (mg)
- Germination_time: Date of germination
- Flowering_time: Date of flowering
Follow-up_freezing.R
R script used to conduct Dunnett's test across Col, wrky38, and lsd1 from the overnight freezing experiment. Requires csv files in the same folder as input.
Folder 3: 3_Allelic_variation
All scripts in this folder use open data from the 1001 Genomes website, and all analyses were conducted for 1001 Genomes accessions.
1_SNP_data_processing.sh
Shell script to generate a distance matrix for 1001 Genomes accessions using whole-genome synonymous SNPs. Requires snpEff, VCFtools, and plink on the Linux OS.
2_TajimaD_WRKY38.r
R script to plot Tajima's D values across chromosome 5 and test if the Tajima's D value of the bin containing WRKY38 significantly deviates from the chromosome level. Tajima's D values were calculated using VCFtools on Linux. The command line to calculate Tajima's Ds is provided in the annotations of this R script.
3_NJ_trees_and_maps_generating.r
R script to plot genomewide neighbor-joining trees, the WRKY38 gene tree, and a map for 1001 Genomes accessions.
4_snp_matrix_and_correlation.r
R script to visualize SNP status (reference/alternative) within 10 kb around WRKY38 gene, and analyze the correlation between each SNP and the most common functional variant of WRKY38 (Frameshift7495793).
5_WRKY38_allelic_variation.r
This R script analyzes associations between WRKY38 functional variants and bioclimate variables using linear mixed models (LMMs), kinship included as a random effect, and t-tests for the differentiation of bioclimate variables between accessions with and without certain WRKY38 functional variants.
6_LSD1_GEMMA_input_prep.r
R script to generate input files that [GEMMA] (https://github.com/genetics-statistics/GEMMA) requires. SNPs within 10 kb of the LSD1 gene are used to test for their associations with bioclimate variables.
7_LSD1_GEMMA.sh
Shell script to conduct GEA using GEMMA.
8_LSD1_expression.r
R script to analyze expression differentiation between LSD1 alleles and their relationship to bioclimate variables.
