README.txt -- archive for Boechera stricta flowering GWAS experiments ___________________________ R and Rmarkdown scripts show input of original data, renaming SAD & LTM genotypes, perform ANOVAs, log-transform flowering date, compute heritabilities, and export BLUPs for subsequent analysis. For any Excel files, notice that the R code allows open access to all Excel files. Scripts by Thomas Mitchell-Olds, Baosheng Wang, Julius Mojica, and Wenjie Yan. Input flowering data for ANOVA & create BLUP output files: Experiments B1, B3, B4, B5, B6: Analyses shown in: Baosheng_Expt_1.html - Baosheng_Expt_6.html Trait data from: BSW_flowered_Set1.txt - BSW_flowered_Set6.txt (Excluding B2) Columns of input files: RefPop = accession ID Set = Set Rep = Replicate Pos = Position FlrDat = Age at first flower The output BLUP files are in: Expt_BSW_Set1_LogFlr_BLUPS.txt - Expt_BSW_Set6_LogFlr_BLUPS.txt Containing columns: ID = accession ID RP_mean1 = mean of log(FlrDate) for this accession RP_blup1 = mean of logFlrDate BLUP for this accession (The RP_meanXX and RP_blupXX names are adjusted in each expt) ___________________________ Experiment E1: Analysis shown in: Emily_Plant_Data.html Data input file: GWAS_induced_data_20170619.TMO.03.xlsx Data columns used: RefPop = accession ID Treatment_2 = Treatments happended after flowering, Not used here Set = A blocking variable based on planting date. (Yan et al) Replicate = Replicate Block = Block nested within Replicate Alive2 = alive after cold treatment Out_Cold_Date = Date take from cold Flower_date = days after cold treatment Output file is: Expt1_Emily_LogFlr_BLUPS.txt Columns are ID = accession ID RP_meanE1 = mean of log flowering date for this accession RP_blupE1 = BLUP of log flowering date for this accession ___________________________ Analyses for experiments W1, E2, W3, W4: Analyses shown in: Wenjie_Expt1d.html, Wenjie_Expt2c.html, Wenjie_Expt3.html, Wenjie_Expt4.html Data input files: wenjie_first-2020-04-23.xlsx, columns used: ExptNum = experiment FlrAge = age at first flower PltNum = plant number GenoNum = accession ID RepNum = replicate number RackNum = rack number Alive2 = alive after cold (yes/no = 0/1) wenjie_second-2020-04-23.xlsx, columns as above wenjie_third-2020-05-18b.xlsx, columns used as above, except RefPop = accession ID wenjie_fourth-2020-05-18b.xlsx, columns used as above, except RefPop = accession ID Output files are: Wenjie_first_LogFlr_BLUPS.txt, Wenjie_second_LogFlr_BLUPS.txt, Wenjie_third_LogFlr_BLUPS.txt, Wenjie_fourth_LogFlr_BLUPS.txt With columns: ID = accession ID RP_meanW1 = mean of log flowering date in W1 RP_blupW1 = BLUP of log flowering date in W1 (The RP_meanXX and RP_blupXX names are adjusted in each expt) This completes input and processing of 10 flowering data files. ___________________________ Merge_BLUPs.html merges BLUPs from 10 environments, then computes correlation among environments, does PCA, and prints the biplot. Input files are the output files above. Merged output is saved to LogFlr_BLUPs_10_Expts.txt, with the following column names: ID RP_blupE1 RP_blupW1 RP_blupW2 RP_blupW3 RP_blupW4 RP_blup1 RP_blup3 RP_blup4 RP_blup5 RP_blup6 These columns contain BLUPs with columns in each experiment, and rows for each accession ___________________________ GxE_E1_W3.html plots norm of reaction. Input data from LogFlr_BLUPs_10_Expts.txt (above) and from ID_PopGroup_worldclim_PCA.txt Columns for accession ID, elevation in meters, log flowering date, longitude, latitude, Bioclim scores 1 .. 19, climate PC scores 1 .. 5, and PopGroup Output to Reaction_Norm.E1.W3.png ___________________________ Maps_3b.html creates two maps, one for PopGroups, and one for scaled flowering date. Input file: Bstricta_RefPop_20_02_21.FlrGWAS.xlsx With columns: ID = accession ID Latitude = Latitude of collection site Longitude = Longitude of collection site Elev_m. = Elevation of collection site in meters PopGroup = Population group Maps_3b.html also saves "RefPop_lat_lon.txt" with columns: ID = accession ID LogFlrDat = log flowering age Elev_m = Elevation of collection site in meters lat = Latitude of collection site lon = Longitude of collection site ___________________________ Get_100_SNPs2.R examines patterns of pleiotropy for flowering time across environments for genome-wide larger-effect SNPs. Then performs permutation test: does the average correlation among QTLs differ from the average correlation among SNPs? Input files: B1.BLUPs.txt_emmax.ps.mod B3.BLUPs.txt_emmax.ps.mod B4.BLUPs.txt_emmax.ps.mod B5.BLUPs.txt_emmax.ps.mod B6.BLUPs.txt_emmax.ps.mod E1.BLUPs.txt_emmax.ps.mod W1.BLUPs.txt_emmax.ps.mod W2.BLUPs.txt_emmax.ps.mod W3.BLUPs.txt_emmax.ps.mod W4.BLUPs.txt_emmax.ps.mod These emmax.ps.mod files donÕt have column names. Column names are assigned as follows: ChrNt = concatenated string with chromsome and nt position EffectSize = phenotypic effect size of each SNP Pval = P-value Chr = chromosome number SNPpos = nucleotide position ___________________________ Bioclim_QTLs_E1.html examines QTLs & Bioclim data. Plots geographic locations of QTL alleles Input: Flowering_SNPs_transpose.xlsx Columns for accession ID and genotype at each SNP. AA, AB, BB genotypes are 0, 1, 2, respectively ID_PopGroup_worldclim_PCA.txt Columns for accession ID, elevation in meters, log flowering date, longitude, latitude, Bioclim scores 1 .. 19, climate PC scores 1 .. 5, and PopGroup Wenjie_third_LogFlr_BLUPS.txt Containing columns: ID = accession ID RP_meanW3 = mean of log(FlrDate) for this accession RP_blupW3 = mean of logFlrDate BLUP for this accession ___________________________ Bioclim_Pop_Group.html. Input and compute climate data Uses RefPop_lat_lon.txt, above. Worldclim data has been computed and stored in RefPop_bio.txt, or we can download online data Several intermediate files are computed, explained, and saved. ___________________________ GxE_peaks.R Get significant climate QTLs and compare to flowering candidates and observed flowering QTLs Input files: E1W3.txt_emmax.ps.mod B1W2.txt_emmax.ps.mod These emmax.ps.mod files donÕt have column names. Column names are assigned as follows: ChrNt = concatenated string with chromsome and nt position EffectSize = phenotypic effect size of each SNP Pval = P-value Chr = chromosome number SNPpos = nucleotide position Bs_FLT_ortho.txt information on known flowering genes. Columns used: Bs_gene = name of stricta gene At_gene = name of Arabidopsis gene Bstricta_278_v1.2.gene_exons.chr.gff3.xlsx locations of B. stricta genes Columns used: Chr_Num = Chromosome number Chr_start = Start position for gen Chr_end = End position for gene Name = gene name in B. stricta ___________________________ Climate_peaks.R Compare significant climate QTLs to flowering candidates and observed flowering QTLs. Input files used (similar to above): PC2.txt_emmax.ps.mod PC3.txt_emmax.ps.mod PC4.txt_emmax.ps.mod PC5.txt_emmax.ps.mod ___________________________ GWAS_flowering_plasticity.noZ.R Script to find pairs of environments for GxE analysis. Find environments with the lowest correlations, excluding W3, and also include E1W3. Calculate plasticity as Xi Š Yi. Does the correlation between first and second season expts differ from the rest? Then, analyze diffs in Log flowering time in six environment pairs. Input file: LogFlr_BLUPs_10_Expts.txt (see above) Output data for GWAS to files with environment pairs (B4W4.txt, W2E1.txt, B1W1.txt, B1W2.txt, W1W2.txt). Columns contain accession ID and the difference in trait values between two environments, for each acccession. ___________________________ emmax_manhattan.R From emmax output file, find most significant SNP on chosen chromosome. (Or you can mask to use part of a chromosome). First run this program. Then update "info" change as needed in annotations.R OR emmax_manhattan_GxE.R. From emmax output file, find most significant SNP on chosen chromosome. (Or you can mask to use part of a chromosome). First run this program. Then update "info" change as needed in annotations.R ___________________________ Annotations.R Get gene annotations near LOD peak, given known flowering candidates. Uses output from emmax_manhattan.R or emmax_manhattan_GxE.R User needs to choose chromosome to be checked for most significant SNP Uses input E1W3.txt_emmax.ps.mod (or similar from other experiment). Update different input files as needed. See changeable inputs on lines 12 - 17 Outputs text to screen: Experiment, most significant SNP in chromosome X, and position of max.SNP __________________________ Overlap_climate_SNPs_FlrDat_SNPs.R Using emmax output file, extract climate SNPs from PC2 & PC4. Find overlap with FlrDat SNPS. How many climate PC SNPs are near flowering time QTLs? Input files: Ref_PC2.txt_emmax.ps.mod or similar. SNP_Locations.txt with location info for QTLs QTL.name = name Chr = which chromosome Envirn = identified in which experiment SNPpos = nucleotide location for QTL kind = "Flr.QTL" or "Climate" __________________________ emmax_QQ_B5.R Produces QQ-plot for P-values from emmax GWAS Input files: B5.txt_emmax.ps.mod or similar. Outputs QQ-plot for a given experiment __________________________ Three scripts are needed for an emmax run: 1) case.gwas_bs.sh will be executed by the SLURM script (below) 2) param.config, used within case.gwas_bs.sh (see the following line in case.gwas_bs.sh) source /dscrhome/jpm76/input.files/param.config 3) Script (below), including locations of three files __________________________ #This shows a script to run gwas on data file B1.BLUPs cd /hpc/group/tmolab/wy62/archive cd flower.files/ ###run each experiment sh case.gwas_bs.sh B1/B1.BLUPs.txt __________________________ # This shows a script for gwas on a flowering PCA file cd /hpc/group/tmolab/wy62/archive/Ref_PCA ###run example PCA file sh case.gwas_bs.sh PC1/Ref_PC1.txt __________________________ The following R-script is called to keep genotypes with both SNP and trait data: make_genokeep_phenokeep.R __________________________ Scripts and outlines for population genetics analyses: For EHH, LD, Pi and H12, we include all code in the file ("processes of population genetic analyses.txt"). For Fst and dxy analyses, there are more than 38 scripts and input files used for analyses. For these, see the file "processes of population genetic analyses.txt". All population genetics scripts and input files are in a zipped file: "code and input used for Fst and dxy analyses.zip". File Bs517_library_Genbank_SRP.submit20210114.xlsx contains accession numbers for each accession. Columns are: SampleID = RefPop accession number Latitude = latitude in degrees Longitude = longitude in degrees PopGroup = Population Group SRAStudy = short read archive accession number (see https://www.ncbi.nlm.nih.gov/sra/docs/submitmeta/ for details)