Workflow for pronghorn GBS libraries NOTE: most steps performed on University of Wyoming Advanced Research Computing Center's computer cluster, which uses SLURM job scheduling 1) Download data from sequencing facility and move to cluster computer 2) Run fastqc 3) Use bowtie to filter out PhiX (positive control added by sequencing facility), Ecoli & Wolbachia, and Illumina adapters First time, need to build bowtie files from fasta files for all contaminants: bowtie2-build ecoli-k-12.fasta ecoli-k-12 bowtie2-build illumina_oligos.fasta illumina_oligos bowtie2-build illumina_oligos-mse.fasta illumina_oligos-mse bowtie2-build illumina_oligos_w_hindIII.fasta illumina_oligos_w_hindIII bowtie2-build phix174.fasta phix174 bowtie2-build Wolbachia_NC_006833.fasta Wolbachia_NC_006833 After building these once, use in all future bowtie runs with script: #!/bin/bash #SBATCH --job-name contam #SBATCH --nodes=1 #SBATCH --cpus-per-task=16 #SBATCH --account=wildgen #SBATCH --time=2-000:00:00 #load modules module load swset/2018.05 module load gcc/7.3.0 module load bowtie2 bowtie2 -q -x /project/wildgen/mlacava/filter/ecoli-k-12 -U /project/wildgen/mlacava/data/PH3.fastq --un PH3_1.fastq --local -k 1 --score-min ""L,0,0.40"" bowtie2 -q -x /project/wildgen/mlacava/filter/illumina_oligos -U /project/wildgen/mlacava/filter/PH3_1.fastq --un PH3_2.fastq --local -k 1 --score-min ""L,0,0.40"" bowtie2 -q -x /project/wildgen/mlacava/filter/illumina_oligos-mse -U /project/wildgen/mlacava/filter/PH3_2.fastq --un PH3_3.fastq --local -k 1 --score-min ""L,0,0.40"" bowtie2 -q -x /project/wildgen/mlacava/filter/illumina_oligos_w_hindIII -U /project/wildgen/mlacava/filter/PH3_3.fastq --un PH3_4.fastq --local -k 1 --score-min ""L,0,0.40"" bowtie2 -q -x /project/wildgen/mlacava/filter/phix174 -U /project/wildgen/mlacava/filter/PH3_4.fastq --un PH3_5.fastq --local -k 1 --score-min ""L,0,0.40"" bowtie2 -q -x /project/wildgen/mlacava/filter/Wolbachia_NC_006833 -U /project/wildgen/mlacava/filter/PH3_5.fastq --un /project/wildgen/mlacava/data/PH3_filtered.fastq --local -k 1 --score-min ""L,0,0.40"" #delete intermediate files rm -f PH3_1.fastq rm -f PH3_2.fastq rm -f PH3_3.fastq rm -f PH3_4.fastq rm -f PH3_5.fastq View results of filtering: grep -A4 'unpaired' slurmXXXXXX.out 4) Demultiplex fastq file using parse_barcodes768.pl Before running this perl script the first time in Mt Moran, you have to download a perl module called Text::LenevsteinXS (below instructions are from https://stackoverflow.com/questions/2980297/how-can-i-use-cpan-as-a-non-root-user) From home directory in Mt Moran: wget -O- http://cpanmin.us | perl - -l ~/perl5 App::cpanminus local::lib eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib` echo 'eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib`' >> ~/.bash_profile echo 'export MANPATH=$HOME/perl5/man:$MANPATH' >> ~/.bash_profile cpanm Text::LevenshteinXS First, create a barcode file to demultiplex with - want a csv with 3 columns with headers: name, barcode, Idcode. name is index_#nt_#. barcode is an 8-10 nucleotide sequence, Idcode is your unique identifier for each sample formatted as: [0-9]+_[0-9]+_[a-z]+_[0-9] - example: 01_15_PH_410 NOTE: remember to give replicates different id codes (e.g., PH and PHrep) NOTE: if you use capital letters, you have to edit split fastq script later (I did this, but probably easier to just use lower case letters) Once you have your barcodes csv, run this script: #!/bin/bash #SBATCH --job-name PH3_parse #SBATCH --nodes=1 #SBATCH --ntasks-per-node=16 #SBATCH --account=wildgen #SBATCH --time=1-000:00:00 perl ./parse_barcodes768.pl /project/wildgen/mlacava/parse/barcodes_PH3 /project/wildgen/mlacava/data/PH3_filtered.fastq K00179 5) Split into individual fastq files after demultiplexing using splitFastq.pl Script: #!/bin/bash #SBATCH --job-name PH3_split #SBATCH --nodes=1 #SBATCH --ntasks-per-node=16 #SBATCH --account=wildgen #SBATCH --time=1-000:00:00 perl splitFastq.pl barcodes_PH3 parsed_PH3_filtered.fastq NOTE: if you used capital letters or a mix of capital and lower case in barcodes file for parse, you need to edit line 27 of splitFastq.pl (change [a-z] to [A_Za-z] to allow mix of both) 6) Run dDocent_prep.sh to produce uniq.F.fasta, whic pares data down to unique fragments to use in de novo assembly in cd-hit Run within folder with fastq.gz files: dDocent_prep.sh Note: make sure line to zip fastq files is uncommented if it's not done yet and commented if you've already zipped then (this part takes a long time and it will re-zip everything if you leave that line of code in when you don't need it) Note: change x= parameter on line 30 to whatever you want the minimum number of identical reads required for read to be kept -> makes cdhit way faster to have higher number (default x=1, I used x=5) 7) Run cdhit in command line #!/bin/bash #SBATCH --job-name cdhit.90 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=16 #SBATCH --account=wildgen #SBATCH --time=005:00:00 module load gcc/7.3.0 cd /project/wildgen/mlacava/assemble/ddocent/all /project/wildgen/bin/cd-hit-est -i uniq.F.fasta -o ph718_cdhit.90.fa -M 0 -T 16 -c 0.90 Parameters: -c = percent match (can try .9, .94, .98, but probably stick to .90) 8) Create bwa index in command line module load swset/2018.05 module load gcc/7.3.0 module load bwa/0.7.17 bwa index [consensus].fa 9) Run bwa perl wrapper to map individual fasta files to reference genome - edit lines 48-50 (paths) - edit line 78 (replace consensus fasta file name) perl wrap_bwa_all.pl 9a) Count how many reads aligned for each individual count_assembled.pl [folder with bam files]/ Check results in assembly_perind.txt file - Use this to decide which individuals to discard at this point Make a list of individuals to discard called to_discard Make a list of individuals to keep called to_keep 10) Discard some individuals Want to delete individuals with few reads (tens, hundreds, thousands) reads that aligned (want to remove now because next steps start to look at how many individuals have data at a locus, and retaining shitty individuals screws with the proportions) In assemble/bwa/all, make a folder called bam_bai Also make folder called bam_bai_discards To move files to keep folder: while read -r PART; do mv *"$PART"* bam_bai ; done < to_keep To move files to discard folder: while read -r PART; do mv *"$PART"* bam_bai_discards ; done < to_discard 11) Run mpileup to call variants #!/bin/bash #SBATCH --job-name mpileup #SBATCH --nodes=1 #SBATCH --account=wildgen #SBATCH --time=3-000:00:00 #load modules module load swset/2018.05 module load gcc/7.3.0 module load samtools samtools mpileup -P ILLUMINA --BCF --max-depth 100 --adjust-MQ 50 --min-BQ 20 --min-MQ 20 --skip-indels --output-tags DP,AD --fasta-ref /project/wildgen/mlacava/assemble/ddocent/ph3_dDocent_mismatch0.98.fa /project/wildgen/mlacava/assemble/bwa/ph3/bam_bai/aln*sorted.bam --output mpileup_ph3filtered_14aug2017.bcf &> log_mpileup_ph3filtered_14aug2017.txt 12) Run bcftools for ph3 after mpileup for variant calling module load swset/2018.05 module load gcc/7.3.0 module load bcftools/1.8 bcftools call -m --variants-only --format-fields GQ --skip-variants indels -O v -o ph3_rawvariants.vcf /project/wildgen/mlacava/assemble/bwa/ph3/mpileup_ph3_28july2017.bcf bcftools filter --set-GTs . --include 'QUAL > 19 && FMT/GQ >9' -O v -o ph3_rawvariants-filter.vcf ph3_rawvariants.vcf bcftools view --min-alleles 2 --max-alleles 2 --types snps --apply-filter "PASS" --output-type v -o ph3_rawvariants-view.vcf ph3_rawvariants-filter.vcf 13) Run vcftools in command line: module load swset/2018.05 module load gcc/7.3.0 module load vcftools/0.1.15 module load mawk/1.3.4 vcftools --vcf /project/wildgen/mlacava/variants/ph3_rawvariants-view.vcf --out ph3_variants_7aug2017 --remove-filtered-all --maf 0.05 --max-missing 0.5 --recode --thin 136 --minDP 1 Parameters: maf = minor allele freq max-missing = % individuals that must have data at snp to keep it thin = distance required between snps (set to contig length so only 1 snp per contig) - so we're basically taking the first SNP per read ALTERNATIVE: run vcftools default above, then look at what % data each individual is missing, then note individuals with <50% of data and remove them from pre-vcftools file and then run code above again on that file (results in many more loci because of the requirement for a locus to have data for 50% of individuals to be kept -> if you get rid of crappy individuals first, then you retain more adequate loci) 13a) Look at missingness for all individuals on vcf produced module load swset/2018.05 module load gcc/7.3.0 module load vcftools/0.1.15 module load mawk/1.3.4 vcftools --vcf phall_maf0.5_miss0.5_24aug2017.recode.vcf --missing-indv Then, create list of individuals based on out.imiss file that have more than 50% missing data: mawk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv Using lowDP.indv, remove these individuals from original vcf (output of bcftools with no vcftools filtering): vcftools --vcf phall_rawvariants-view.vcf --remove lowDP.indv --recode --recode-INFO-all --out phall_ind0.5_8sept2017 Then filter MAF 0.05 and remove loci with 50% missing data: vcftools --vcf phall_ind0.5_8sept2017.recode.vcf --out phall_309ind_miss0.5_maf0.05_8sept2017 --remove-filtered-all --maf 0.05 --max-missing 0.5 --recode --thin 136 --minDP 1 Check % missingness for individuals now: vcftools --vcf phall_309ind_miss0.5_maf0.05_8sept2017.recode.vcf --missing-indv 14) Run vcf2mpgl.pl to compile genotype likelihoods First time - need to edit vcf2mpgl.pl script at like 101 to match sample naming convention ( I had to replace [A-Z]+_[A-Z]+_\d+ with [0-9]+_[0-9]+_[A-Za-z]+_[0-9]+) - if naming convention does not match, it will say XXXX loci, 0 individuals vcf2mpgl.pl ph3_variants_7aug2017.recode.vcf 15) Run gl2genest.pl to convert likelihoods to genotype matrix (locus x individual) *returns NA if all 3 genotypes are equally likely (takes a few min) gl2genest.pl ph3filtered_variants_15aug2017.recode.mpgl mean Give function at end that you want to use options: mean, mode Ends up with txt file starting with pntest -> export to local macine to use in R scp mlacava@mtmoran.uwyo.edu:/project/wildgen/mlacava/variants/all/pntest_mean_phall_309ind_miss0.5_maf0.05_8sept2017.recode.txt /Users/melanielacava/Data/AllLibs/ 16) Export list of individuals that made it through filtering so that you can match IDs back to data in R After running all filtering you want, use final vcf (the one you put into mpgl and then gl2genest) to get list of retained individuals -> create text file module load swset/2018.05 module load gcc/7.3.0 module load bcftools/1.8 bcftools query -l phall_309ind_miss0.5_maf0.05_8sept2017.recode.vcf > sampleIDs309.txt Export to local machine: scp mlacava@mtmoran.uwyo.edu:/project/wildgen/mlacava/variants/all/inds309_8sept2017.txt /Users/melanielacava/Data/AllLibs/ 17) Convert using PGDSpider as needed Downloaded pgdspider on website - http://www.cmpg.unibe.ch/software/PGDSpider/#Download_and_Installation_Instructions) From within Users/melanielacava/Applications/PGDSpider_2.1.1.2 folder, start GUI: java -Xmx1024m -Xms512m -jar PGDSpider2.jar For STRUCTURE output with two rows per individual: Click create/edit SPID file -> go to STRUCTURE - Writer Questions and 3rd line down change data type from msat to SNP For STRUCTURE output with two columns per individual from vcf format: Convert vcf to plink: module load intel/16.3 module load perl module load core/2017 module load vcftools/0.1.15 vcftools --vcf phall_309ind_miss0.5_maf0.05_8sept2017.recode.vcf --plink --out phall_maf0.5_miss0.5_ind0.5_15sept2017 Export .map and .ped files to local machine: scp mlacava@mtmoran.uwyo.edu:/project/wildgen/mlacava/variants/all/phall_maf0.5_miss0.5_ind0.5_15sept2017* ./ Use plink to convert plink to structure: plink --file phall_maf0.5_miss0.5_ind0.5_15sept2017 --recode-structure --noweb --out phall_maf0.5_miss0.5_ind0.5_15sept2017 ABOUT OUTPUT: .recode.strct_in (Structure format) Produced by '--recode structure', for use by Structure. This format cannot be loaded by PLINK. A text file with two header lines: the first header line lists all V variant IDs, while each entry in the second line is the difference between the current variant's base-pair coordinate and the previous variant's bp coordinate (or -1 when the current variant starts a new chromosome). This is followed by one line per sample with the following 2V+2 fields: 1. Within-family ID 2. Positive integer, unique for each FID 3-(2V+2). Genotype calls, with the A1 allele coded as '1', A2 = '2', and missing = '0' Change structure file extenstion from .strct_in to .stru 18) Calculate depth of coverage metrics Mean individual depth (across all loci) module load swset/2018.05 module load gcc/7.3.0 module load vcftools/0.1.14 module load mawk/1.3.4 vcftools --vcf [input file] --out [output prefix] --depth Mean site depth (averaged across all individuals) module load swset/2018.05 module load gcc/7.3.0 module load vcftools/0.1.14 module load mawk/1.3.4 vcftools --vcf [input file] --out [output prefix] --site-mean-depth Mean genotype depth (depth for each genotype) module load swset/2018.05 module load gcc/7.3.0 module load vcftools/0.1.14 module load mawk/1.3.4 vcftools --vcf [input file] --out [output prefix] --geno-depth *not sure what this one really tells you, so skipping for now