Last updated 31 Jan 2018. Structure of this repository ============================== 1. Code Batch scripts used to process the data are found in the script/ directory. The batch scripts were copied as-is from the Minnesota Supercomputing Institute servers. As such, they will not run on other systems without modification, but they do show the programs and settings used to process the data. Helper python and R scripts are found in the bin/ directory. Code for making figures is in the Figures/ directory. All figures were subject to minor manual editing for publication. To re-run the figure code, you will have to copy some files to the directory with the figure-making script -- see comments in each script. 2. Input data and results These are found in the data/ directory. There are 8 sub folders GEMMA- Association analysis raw output, summaries of top variant lists, and code necessary to compile them. GLM- Results from statistical analysis of allele frequency changes. Raw output, summaries of top variants, and R code to generate them HARP- Input data for HARP (Inputs). Inferences of strain frequency in each environment (Strain_estimates) and strain fitness calculated from that output (Fitness). metadata- List of 101 strain names, List of pool names, and additional information of the provenance of the USDA strains used. neighboor_net-info for constructing neighboornets reference- Information on the strain used as a reference 1106 single_innoc_data- single inoculation phenotype data for correlations variants- SNP, MNP, and PAV variant information The file names have been changed from the original names for clarity (e.g. "TY72_gemma_gwa_output.tsv" is better than "output.tsv"). The Workflow section notes the file names associated with each script. 3. Notes on names of experimental communities While we have tried to make the file names in this repo match the terms used in the ms, some of the data files have the original sequencing "code" names for communities. In general: - 101Bulk* = initial communities - HM101 = A17 - HM340 = R108 - FS = Field soil - 101Bulk_TY* = Liquid medium (TY) communities There were communities that went through the data processing pipeline but were not included in this ms. We have removed the results for those communities from the output files when it was easy to do so, but there are still some results for those communities, and their code names are visible in some of the script. There is a file in data/metadata/pool_names.txt that describes each pool included in the ms. Workflow ========== 1. Raw data. There were two rounds of sequencing. Most of the reads that contributed to this manuscript came from the first round (250 bp reads from HiSeq 2500), but some came from the second round (125 bp reads from HiSeq 2500). All the raw reads have been deposited in the SRA at the NCBI (BioProject PRJNA401437, SRR6029825 - SRR60299912). Internally, we have been referring to the first sequencing round as "160908" and the second as "161228". The raw sequencing reads for individual strains have also been deposited in the NCBI SRA: (BioProject PRJNA401434, SRR6055493-SRR6055666). The reference genome and annotation files are under data/reference. data/metadata/101Strains.txt is a text file with all the strains included in the experiment (original MSI location: data/strain/lists/101Strains.txt). 2. Read trimming. Reads were trimmed with TrimGalore - a separate script for each set of reads, but the same settings, except for the minimum trimmed read length: trim_reads_160908_reads_trimgalore_2016-09-16.sh and trim_reads_161228_reads_trimgalore_2017-01-18.sh Trimmed read files are not included in this repository. 3. Alignment to E. meliloti USDA1106 (the reference genome). Alignments were done with BWA using separate, but identical, scripts for the two rounds of sequencing: align_160908_reads_USDA1106_bwa_2017-01-25.sh and align_161228_reads_USDA1106_bwa_2017-01-25.sh 4. Variant calls and filtering. PAVs were called based on presence/absence of genes in denovo assemblies of the 101 strains. Orthologs for all the genomes (including strains not included in this study) were obtained using CD-HIT, and then the presence or absence of each cluster was converted into a VCF file: ortho_cd-hit_2017-06-30.sh SNPs, MNPs, and small indels were called by Joseph Guhlin for a different project (again, on a larger set of strains) using FreeBayes in parallel mode with the default settings on alignments of each individual strain to the USDA1106 genome (with BWA). Before any other processing was done, variants with a quality score < 20 were removed. Finally, we combined and filtered the raw variant calls to get variants useful for this study. There were two sets of filtered variant calls, each filtered for a different purpose. One set of filtered variants was meant for use with HARP to estimate strain frequencies and only included SNPs. This filtered dataset was made with: filter_variants_2016-10-11.sh The most useful files generated by this script are the ones used to run HARP. There was one for each replicon ("Uni0" = Chromosome, "Uni1" = pSymB, and "Uni2" = pSymA), and they are provided: Uni0.snp.variants.hm.txt.gz Uni1.snp.variants.hm.txt.gz Uni2.snp.variants.hm.txt.gz The second set of variants were filtered for use in association testing and scans for allele frequency change. The script used to do this was called: filter_variants_2017-07-04.sh And we have provided the filtered VCF file used for most of the analyses (called genome.vcf.gz in the script). Note that the MAF > 5% filter was applied in the individual downstream analyses. The file: association.genome.vcf.gz (named genome.vcf.gz in the script) A few other files with genotype information in a different format are also included because they were used for some of the figure- and table-generating scripts: genome.variants.tsv.gz: TSV version of VCF file genome.primitives.unnorm.variants.tsv.gz: Multi-site variants (MNPs) broken into individual sites; multi-allelic sites (more than 2 alleles) listed as a single site. genome.realcoords.tsv.gz: Translates between primitives (one entry for every segregating site) and original FreeBayes representation (some adjacent variants listed as a single variant -- MNPs). The first two columns give the actual genomic position of each segregating site; the 4th and 5th columns give the position listed in the FreeBayes representation. Note that the FreeBayes representation is used in association.genome.vcf.gz, genome.variants.tsv.gz, and the association output. SNP effects (missense, intergenic, etc) were annotated with SNPeff on an older version of the filtered variants (but the older version included all the relevant variants, so there was no need to update): annotate_variants_snpeff_2017-04-05.sh The output file from this is: snpeff_annotations.tsv.gz 5. Estimating strain frequencies with HARP. We used HARP (Kessler et al. 2013, MBE) to estimate strain frequencies from SNPs and the alignments of pooled data. We ran a different script for each sequencing dataset, but the settings were the same. Scripts: strain_frequency_160908_reads_harp_2017-03-20.sh strain_frequency_161228_reads_harp_2017-03-20.sh We provide the point estimates of frequencies and the standard errors: Frequencies: 160908.freq.tsv, 161228.freq.tsv Std Errors: 160908.freq.se.tsv, 161228.freq.se.tsv We modified these files so that they only included results from the communities included in the manuscript. As a check of HARP, we coded our own strain frequency estimation method. See: strain_frequency_160908_161228_reads_regression_2017-10-19.sh The Final fitness estimates calculated from these data can be found in: data/HARP/Fitness/FoldChanges_Mean_Final.tsv data/HARP/Fitness/FoldChanges_RDA_Final.tsv 6. Base counts and allele frequency change. We used PoPoolation2 to count the number of bases supporting each allele across the genome. The first stage used samtools mpileup and PoPoolation, with a separate script for each dataset: base_count_160908_reads_popoolation_2017-01-26.sh and base_count_161228_reads_popoolation_2017-01-26.sh Then the results were combined and filtered down to the same group of SNPs that were used in the association testing analysis (MAF > 5%). Note that although the date on this script is earlier than the date on the final iteration of variant filtering and GWA scripts, the set of SNPs did not change during that time period. The combining and filtering script is: base_count_160908_161228_reads_popoolation_2017-05-02.sh and the output file (called "output.vars.sync" in the script) is provided as: allele_counts.sync.gz and sync_file_pools.txt has the order of samples in the file. We also made tables of allele frequencies and read-depths for every pool. Script: allele_freq_160908_161228_reads_from_counts_2017-11-07.sh And then, we used logistic regression to test for changes in reference allele frequency, with this script: genome_scan_160908_161228_reads_glm_2017-05-03.sh The final results are in these tab-delimited files: A17_allele_freq_glm_output.tsv.gz R108_allele_freq_glm_output.tsv.gz FS8W_allele_freq_glm_output.tsv.gz TY72_allele_freq_glm_output.tsv.gz And with annotations: A17_allele_freq_glm_output.annotated.tsv.gz R108_allele_freq_glm_output.annotated.tsv.gz FS8W_allele_freq_glm_output.annotated.tsv.gz TY72_allele_freq_glm_output.annotated.tsv.gz The column names that are not self-explanatory: freqs0 Frequency of the reference allele in each initial pool freqs1 Frequency of the reference allele in each final pool converged 1 if the GLM converged, 0 if it did not Summarized results Top 2000 variants in each environment: Top2000SNPs_A17.tsv Top2000SNPs_FS8W.tsv Top2000SNPs_R108.tsv Top2000SNPs_TY72.tsv Used for compiling Table S1 TopGroupsSNPs_A17.tsv TopGroupsSNPs_FS8W.tsv TopGroupsSNPs_R108.tsv TopGroupsSNPs_TY72.tsv Top 25 representative variants used for S3 and S4 figures: Top25LDReps.tsv And the R code used to create them: GLM_candidates.R 7. Association testing (GWA). We ran standard mixed-model association tests with GEMMA. The script we used is: genome_scan_160908_161228_reads_gwas_2017-07-17_SKmat.sh The raw results from GEMMA are in these tab-delimited files: A17_gemma_gwa_output.tsv.gz R108_gemma_gwa_output.tsv.gz FS8W_gemma_gwa_output.tsv.gz TY72_gemma_gwa_output.tsv.gz The p-values reported in the ms are from the "p_lrt" column. The annotated results are in these files: A17_gemma_gwa_annotated_output.tsv.gz R108_gemma_gwa_annotated_output.tsv.gz FS8W_gemma_gwa_annotated_output.tsv.gz TY72_gemma_gwa_annotated_output.tsv.gz The compiled and summarized results used to make Table S2: TopGWAS_A17.tsv TopGWAS_FS8W.tsv TopGWAS_R108.tsv TopGWAS_TY72.tsv And the code used to make them GEMMA_candidates.R 8. Neighbor networks. SplitsTree and some custom python code. phylo_splitstree_2017-10-13.sh The distance matrix from Phylip is included as: phylip.f84.dist (originally f84.dist in phylo_splitstree_2017-10-13.sh). And the SplitsTree nexus-format output is also included (created using the SplitsTree GUI): neighbor_net.nex 9. Genome-wide LD calculation. ld_genome_mean_r2_plink_2017-11-07.sh, 10. Allele frequency change p-value and read depth figure. Code (plot.r) in Figures/read_depth_glm_figure/. 11. NJ tree and heatmap figure. Code (plot.r) in Figures/nj_tree_and_heatmap_figure/. 12. QQ plots The code to generate qq plots is in Figures/qq_plot_figure/. 13. MAF plots The code to generate plots of minor allele frequency is in Figures/maf_figure/. 14. Code to create all the rest of the figures can be found in the file “Figures/SelectReseq_Figures.R”. It has been designed to work with the file structure in the data deposition.