Please read this file thoroughly before attempting to run any of the scripts contained within this repository ============================================================================================================== **Note: this file is written in markdown format and can be converted to pdf or html format as desired** Overview -------- The contents of this Dryad Data repository contain the scripts and data files necessary to perform STC genotyping as described in: >Stutz, W.E. & Bolnick, D.I., 2014. Stepwise Threshold Clustering: A New Method for Genotyping MHC Loci Using Next-Generation Sequencing Technology. PLoS ONE, 9(7), p.e100587. Please cite this reference if you use STC. The files in this repository serve two purposes. First, they contain the raw data and scripts used to produce the genotyping results described in the manuscript cited above. Second, these files can be used to allow potential users of the STC genotyping method to reproduce the results in order to understand how better how STC genotyping works. We would strongly urge users to familiarize themselves with the scripts and how they work before starting to analyze any MHC data. In theory, users should be able to apply their own 454 generated data using the scripts contained herein, though we recommend at the very least attempting to reproduce the results from the example sample described in the accompanying paper (see "Running STC" below). For users interested in reproducing our results, or using STC to genotype their own data Please the section entitled "How To Use STC" and the "Questions" sections for further instructions. For users interested simply in downloading and looking at the final data set, please see the "Metadata for Data Files" section for further information regarding these data sets. Last Updated ------------ November 21, 2014 Contact Info ------------ Dr. William E. Stutz University of Colorado, Boulder Ecology & Evolutionary Biology N122 Ramaley 334 UCB Boulder, CO 80309-0334 Email: william.stutz@colorado.edu Acknowledgments ---------------- Many, many thanks to Alvaro Sebastian for drastically improving the performance and usability of the readFNA.pl script now included in the repository. Alvaro Sebastian Evolutionary Biology Group Faculty of Biology Adam Mickiewicz University Files Included in Repository ---------------------------- ### data tables sample_barcodes.csv barcode_sequences.csv raw_reads.csv final_data.csv ### fasta format data files reads.fna master.fasta sequences.fasta ### scripts runall.sh readFNA.pl make_sequences.R stc.R stc_functions.R chimeras_functions.R Required Programs ----------------- R (written using R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows", may work with later versions) perl (for demultiplexing raw data) clustalw2 (for alignment, can be changed within the stc.R script to use other alignment software) How To Use STC -------------- ### Types of Users This section is aimed at two types of users interested in genotyping their own samples. For both types, we assume that users have genotyped MHC loci for some number of samples using 454 sequencing technology, and the MHC amplicons and barcodes can be sequenced in their entirety within a single sequencing well (i.e. no 1000 base pair long amplicons) The FIRST type of user is one who has run their samples and wants to proceed from the raw data. These types of users should proceed to the section entitled "Setting up -- Raw Data", where we provide instructions for demultiplexing and setting up data using our included perl and R scripts. The SECOND type of user has run their samples and demultiplexed them using their own preferred method. By demultiplex we simply mean that each read has been assigned to an individual sample on the basis of sequenced barcodes (essentially performing Phase 1 on their own). This is completely compatible with STC genotyping. These users should proceed to "Setting up -- Demultiplexed Data". Users wishing to apply non-pyrosequenced data to the STC method are encouraged to do so, with the caveat that the current versions of the scripts have not been optimized for such data, and may take much more user "tweaking" than we can guide in this README. Lastly, each main step described below has a corresponding command contained within the "runall.sh"script, which lists all the commands necessary to run STC from start to finish. Some of these commands can be run easily from the terminal window. However, we suggest that the stc.R script be run in an interactive session. ### Setting up -- "Raw Data" Start here if you have sequenced your samples using barcode sequences to identify samples in downstream analyses. Please note that our demultiplexing method (perl script readFNA.pl) was written as an "in house" program to transform raw sequence data into a useable format. Users may find that other available programs are easier to use, or more satisfactory, than our provided scripts. Alternatively, they may find that they can write a better program that will do much the same thing. If that is the case we encourage the use of other programs, and we would direct users to start at the section entitled "STC Phase 1 -- Demultiplexed Data" for further instructions on how to use STC. #### Step 1 -- modify input files if genotyping your own data If you are just interested in demultiplexing and running STC on the provided data set (reads.fna), you don't need to do anything. If you want to demultiplex your own data, you'll need to make a couple of changes to your data files. + First, rename the fasta format file containing all the reads in your sequencing run to 'reads.fna'. If you don't have this file, it can be obtained by processing the .sff file produced during 454 sequencing using the Roche software that comes with the sequencing machine. Talk to your sequencing center if you were not provided this file. Alternatively, there may be some third-party software programs that can process .sff files. Ultimately, what you want is the output from every well in the run, minus any platform specific adapters. + Second, you need an input data table containing the names of every sample along with their barcodes. This file should look like the provided "sample_barcodes.csv" file, and should be re- named as such. Please see the "Metadata for Data Files" section for more information about how to format this file. + Lastly, you will need to provide your barcodes in a data table, as in the provided 'barcode_ sequences.csv' file. This data file contains two columns: the numeric designation of each barcode, corresponding to the assignments in "sample_barcodes.csv", and the nucleotide sequence (5' to 3') of each barcode. Rename this file "barcode_sequences.csv". Add all three modified files mentioned above to the working directory. #### Step 2 -- modify readFNA.pl if genotyping your own data Only do this if you are genotyping your data. Otherwise DO NOT CHANGE the readsFNA.pl. At the top of the readFNA.pl is a list of parameters that can/need to be changed to work with your unique data set. These include: + my $forprimer: input your forward primer here (e.g. 'TGTCTTTAACTCCACGGAGC';) + my $revprimer: input your reverse primer here (e.g. 'CTCTGACTCACCGGACTTAG';) + my $barcodelen: input the length of your barcodes (e.g. 15;) + my $minreadlength: this is the minimum read length including barcodes and primers required to use the read + my $minlengthseq: the minimum length of intra-primer sequence necessary to use a given read Note that certain lines of code have been blocked out. These lines are used to account for minor errors ( especially homo-polymer errors) within the primer sequences. By accounting for these, we can add a few extra reads to each sample (typically less < 5%). Please feel free to adjust the primer sequences in these lines of code and include them in your analysis, but it isn't completely necessary. Please re-download the readFNA.pl if you think you made a mistake. The idea here is that readFNA.pl searches for small sequences of nucleotides that signify where barcodes and amplicons start so it can pull them out. As such, these sequences need to be changed to reflect the expected sequences going in. Please see section on "Modifying readFNA.pl" for the exact changes to be made. #### Step 3 -- working directory Make sure all the files included in the dryad repository are in your working directory. The entire STC process from start to finish is written to work out of a single working directory. However, an output file folder will be created in the directory to store new files and differentiate them from downloaded files. This makes it a bit easier to differentiate files that can be modified (output files) and ones that shouldn't (downloaded files, except in some cases). #### Step 4 -- open terminal Open up a terminal window and navigate to the working directory. #### Step 5 -- set up new directories In this step, an "output" folder is created, along with a "results" folder in the working directory. We also move some files we want to use to the output folder (so the program knows where to find them). To do this, copy and paste lines 12-28 of the runall.sh script. #### Step 6 -- run the readFNA.pl to demultiplex and process the reads.fna file At this point, we are ready to demultiplex. The readFNA.pl script takes FOUR input files, which have been created using the above code, or were included in the dryad repository: 'reads.fna', 'sequences', 'barcodes', and 'master.fasta'. The latter file is an empty fasta file that readNFA.pl will look for. This file is unimportant at this stage. It's purpose is to provide a list of known sequences -- both artifactual and true allele sequences -- along with corresponding numeric sequence IDs. If this file contains known sequences, than the same numeric sequence IDs will be used during genotyping. This is especially useful for comparing the results across sequencing runs, because the same numeric IDs refer to the same amplicon sequences in each run. To run readFNA.pl, simply copy the code from line 34 in runall.sh into the terminal. You should see numbers gradually counting up -- this is the script processing successive lines of reads.fna. The script is done when every sequence is processed. #### Step 7 -- understanding the output of readFNA.pl readFNA.pl will produce two new files in the output directory (see lines 95-98 in readFNA.pl) These are: raw_reads.csv and new_master.fasta. New_master.fasta is an updated version of the master.fasta file containing both the previously known amplicon sequences and the new ones found in the current run. In the case that master.fasta was empty it will contain only the sequences found in the current run. The raw_reads.csv file is a data table where each row corresponds to one read from the sequencing run, along with information about the read, including which barcodes it contained (if identifiable) what sample the read came from (if possible), and the ID # of the sequence corresponding to the new_master.fasta file. Please see the "Metadata for data files" section below for more information about the data columns contained in the raw_reads.csv file. Note also that the repository already contains the raw_reads.csv file produced when readFNA.pl is run using the included reads.fna file. #### Step 8 -- create sequences.fasta file using the make_sequences.R script This scripts uses the raw_reads.csv file and the new_master.fasta file to create a fasta file containing the sequences and IDs of all unique sequences identified in the current run. Note that the repository already contains a version of this file, the contents of which correspond to the reads in reads.fna. At this point you are read to move on to "Running STC". ### Setting up -- "Demultiplexed data" Users should start here if they have already demultiplexed their data and do not need to use the accompanying readFNA.pl script to do so. However, there are a couple of steps that will need to be taken to ensure that the data is compatible with STC. First, you will need a fasta file containing every unique amplicon sequence (sequence between the primers) found in the run. Each sequence will need a unique numeric ID. Name this file 'sequences.fasta'. See the provided sequences.fasta file for an example. Second, you will need a sample_barcodes.csv file. The included version of this file contains four columns: fbc, rbc, sampleID, and genericID. SampleID's are the actual names of the samples as you identify them (i.e. can be alphanumeric). genericID's are simply numeric IDs assigned to each sample (i.e 1 - # of samples). For example, my firsts sample may have a sampleID of Samp10-001. I will give it the corresponding genericID 1 or 1. Third, you will need to create your own version of the raw_reads.csv file. There is a version of this file included in the repository. Each row in this file represents one sequenced read. For your purposes, your new version will need to include, at minimum, four columns: fbc, rbc,'genericID' and 'seqID'. 'genericID' is a numerical (no letters or symbols) ID indicating which sample the read has been assigned. For example, in the raw_reads file, all reads originating from sample Samp10-001 have a 1 in the genericID column. 'seqID' gives the numerical ID of the amplicon sequence for that read which corresponds to the list of sequences provided in the sequences.fasta file. fbc and rbc indicate the numeric IDs for the forward and reverse barcode sequences used for each sample. If you didn't use barcodes, its fine. Simply give the samples dummy numeric barcodes. Fourth, create a folder named 'output' within the working directory and copy these three files into it. At this point, you can proceed to "Running STC". ### Running STC The rest of the STC genotyping process is run via the stc.R script included in the repository. If the "Setting Up" steps have been run correctly, stc.R should work correctly. To start, open up in an interactive R session in the working directory, and open the stc.R script. The first section of the script allows the user to assign values for all the user-defined parameters. These are described fully in the accompanying manuscript. Leave these values alone if one is genotyping the provided data set. The create_sequence_tables (TRUE/FALSE) will create individual results tables for each sample if marked TRUE. The rest of the script should run without much further input from the user. Please read the descriptions and comments throughout the script for further details about what is occurring. Note that most of the operations are performed by functions contained within the stc_functions.R script. Starting at line 106, the script will start cycling through samples using a for loop, and performing STC on each sample individually. It is advised that new users try individual samples one and at a time by bypassing the 'for' loop starting at line 114. Instead, a single value for the 'sample' variable can be input directly. A good place to start is reproduce the results for the single example sample from the accompanying manuscript. To do this, run run line 112 which will designate the 256th sample in the samples object as the target sample. This corresponds to sampleID 490, which is the sample used in the manuscript to illustrate the STC process. **note**: a previous version of stc.R contained a line asking for the file all_TAs_parapatry.csv. This line was included in error and referred to a file specific to the data analysis performed in the paper. This newer version no longer includes this line and should cross-check putative alleles against good alleles found within the run being analyzed. Additional Questions -------------------- ### (1)"What type of data (i.e. platform) can I use with STC?"# STC was originally created to analyze reads produced using 454 sequencing. As such, it assumes that the number of reads per sample generated is on the order of 100's to 1000's. Data produced on platforms capable of producing millions of reads (10's of thousands or more reads per sample, i.e. Illumina) may or may not work well with STC as currently implemented. Please see the manuscript listed above for suggestions on applying STC for Illumina produced data. Additionally, STC was also created to analyze samples with a maximum of 12 different allelic sequences per sample. IF your samples are expected to have substantially more alleles (> 25), STC should still work. However, the time taken may be quite long and the input parameters may need to be adjusted quite a bit to account for the distribution of reads among different alleles and the number of reads needed to accurately genotype a single sample (will be much > 80 reads). ### "How many samples can I run with STC"# STC can be run for any number of samples. In fact, it can be run for single samples, which is a good way to optimize the adjustable parameters if the allele sequences for those samples are known a priori (i.e. from cloning or previous genotyping). There is one major "for" loop in the script with iteratively runs through phases 2 and 3 of STC for a given list of samples. ### "Do I need any other programs or software to run STC?"# Yes, STC currently uses clustal to align reads. Please make sure clustalw2 is installed and callable using the clustal function included in the 'ape' R package. See 'ape' documentation for more information. ### "What are good values for the input parameters listed in the STC script?"# The values supplied in the script were optimized for the data set analyzed in the accompanying manuscript. Most of these will represent decent starting values for most data sets and can be tweaked as needed. However, there a few which will be vary quite a bit depending on the study: (1) amp_length: will need to be set individually for every study. It corresponds to the length of amplified fragment (not including primers) expected to be produced during PCR. (2) min_reads: currently optimized for samples with a maximum of 12 alleles. If more alleles are expected, this number will need to increase to ensure enough coverage. Please see manuscript for suggestions on how to set this parameter. (3) size_thresh: sets the minimum cluster size needed for a cluster to be considered a good cluster. This number will also depend on the maximum number of alleles expected in any given sample. Please see manuscript for suggestions on how to set this parameter. ### "What if I have duplicate samples in the same run?"# That's fine. Just give duplicate samples different genericIDs in the reads.csv file if starting with demultiplexed data, and different genericIDs and barcodes in the sample_barcodes.csv file if starting with raw input data. STC will treat these as different samples for genotyping. ### "How are potentially chimeric alleles identified?"# We have implemented a function within R (within the stc.R script) to check if potential true alleles are chimeras (see Phase 4 of the manuscript). It performs a simple scan of alleles within samples to check for possible recombination breakpoints indicating that two alleles could have been the parent alleles of a third allele. If you have your own preferred method for identifying chimeric alleles, by all means, do so. ### "Where are all the functions used in this script?" All of the functions are included in the accompanying stc_functions.R script. These functions are the nitty-gritty of STC. We have elected to separate them out from the main script so that the individual steps and overall structure of STC can be easily discerned without users getting bogged down in endless lines of code. IF any particular step is not working, by all means consult the stc_functions.R file ### "What do I get at the end?" If the stc.R script runs successfully through all phases of STC you will end up with a lot of files, of which the following are important: all_TAs_phase4.csv: a table of all the true alleles found for each sample alleles_summary: summary of all true alleles identified during genotyping across all samples samples_summary: summary of all samples genotyped (# of reads, alleles, etc...) TA.fasta: fasta file of the sequences of all true alleles TAs.aln: alignment of all true allele sequences TA_dist.csv: distance matrix of all true allele sequences sample specific files (where denotes sample ID): xxx_dist.csv: distance matrix of all sequences found for that sample xxx.fasta: fasta file of all sequences found for that sample xxx.aln: alignment of sequences for that sample xxx_dist.csv: distance matrix of all sequences found for that sample xxx_pairs.csv: list of combined pairs of sequences xxx_sequence_summary.csv: summary of cluster results for by sequence xxx_clusters.csv: table of sequences organized into final clusters xxx_smalls.csv: table of sequences in small clusters after phase 3 xxx_TAs.csv: table of true alleles xxx_true.fasta: fasta file of true alleles ###"Is there anything else I should know?" We would urge anyone who wishes to try STC on their data set to CAREFULLY read over the manuscript and this README file before starting. We would also recommend starting with just one sample and working through phases 2 and 3 of STC step by step (i.e. don't use the for 'for' loop which loops through multiple samples). Metadata for data files ----------------------- #### sample_barcodes [samples genotyped along with barcodes used] genericID: numeric identification given to each sample (i.e 1 - # of samples) sampleID: actual identification for each sample (can be alpha-numeric, e.g. samp330-01) fbc: the forward barcode ID number rbc: the reverse barcode ID number #### barcode_sequences.csv [nucleotide sequences of each barcode used] barcode: the barcode number sequence: the 5'-3' nucleotide sequence of the barcode #### raw_reads.csv [output from readFNA.pl includes rows corresponding to each sequenced read] seq number: numeric identifier for each sequence produced during sequencing read type: forward or reverse read read length: length of read forward barcode: forward barcode sequence forward primer: forward primer sequence: amplicon sequence reverse primer: reverse primer sequence reverse barcode: reverse barcode sequence fbc: forward barcode ID rbc : reverse barcode ID genericID: numeric ID of the sample to which the read was assigned seqID: numeric ID of amplicon sequence (corresponds to sequences in sequences.fasta) #### final_data.csv: [final data set after running stc.R, rows correspond to sequenced samples] sampleID: sample identification duplicate: A or B, B indicates duplicate read_number: number of reads sequenced read_number_used: number of reads used (i.e. less than read number if sub-sampled) good_alleles: number of good clusters after phase 3 small_clusters: number of small clusters after phase 3 amb_clusters: number of ambiguous clusters after phase 3 dropped_alleles: number small clusters that were converted to good clusters at phase 4 all_alleles: total number of good alleles after phase 4 chimeras: good alleles identified as likely chimeras incorrect_length: good alleles with open reading frames true_alleles: final number of true alleles for each sample (good alleles - (chimeras + incorrect length) X1, X2,...: correspond to allele 1,2, etc, and indicate whether a given sample has each allele