############### Work log for analyses of RAD data for the manuscript Disentangling the complex evolutionary history of the Western Palearctic blue tits (Cyanistes spp.) ############### – phylogenomic analyses suggest radiation by multiple colonisation events and subsequent isolation ############### by Martin Stervander (martin.stervander@biol.lu.se) ############### This log makes use of plenty of neat little scripts made by people all over the place. Thank you for publishing your scripts! I have inserted references to where all those snippets have been downloaded. One script, by Shannon HEdtke, was modified to fit the purposes in this workflow, and that modified script is provided. ########### Processing of raw data, Stacks, filtering > biallelic dataset ### Raw data cleaned with process_radtags and filtered from PCR clones using clone_filter, both from Stacks 1.17. ### Resulting reads mapped against the zebra finch genome taeGut3.2.4 with Bowtie 2.2.2, setting maximum and minimum mismatch penalties (-mp) to 2 and 1, respectively. Further, we only accepted paired reads mapping in the expected relative direction within 150–1000 basepairs (bp) from each other (--no-discordant, --no-mixed). ### From the resulting SAM files we removed read 2 and used read 1 as input to Stacks: cat file.sam | awk '{ if($1~/^@/ || $1~/1$/) print $0}' > file.r1.sam ### Running Stacks on server, sql recording disabled: ref_map.pl -S -m 4 -T 45 -B paridae_all_refTgu_radtags -b 1 -D "All Paridae (low coverage samples included); mapped to Tgu w bowtie (mp21fragm150-1000nomixnodisc), paired reads removed; ind=pop, 23 ind; init req coverage p=29; phyl" -a 2014-08-29 \ -X "populations:-p 23" -X "populations:--vcf" -X "populations:--plink" -X "populations:--fasta" -X "populations:--phylip" -X "populations:--phylip_var" \ -O /path/popmaps/popmap_ind_all \ -o /path/stacks/refTgu/ \ -s /path/sam/CLP3.r1.sam \ -s /path/sam/CEH2.r1.sam \ -s /path/sam/CLG2.r1.sam \ -s /path/sam/CTE2.r1.sam \ -s /path/sam/CGC2.r1.sam \ -s /path/sam/CFU4.r1.sam \ -s /path/sam/CLA1.r1.sam \ -s /path/sam/MCE2.r1.sam \ -s /path/sam/ALG3.r1.sam \ -s /path/sam/LIB1.r1.sam \ -s /path/sam/JDI2.r1.sam \ -s /path/sam/GRE1.r1.sam \ -s /path/sam/ECA6.r1.sam \ -s /path/sam/SRE4.r1.sam \ -s /path/sam/Ccya.r1.sam \ -s /path/sam/Pmaj.r1.sam \ -s /path/sam/Ppal.r1.sam \ -s /path/sam/Phud.r1.sam \ -s /path/sam/P_a_abietum_PA7602.r1.sam \ -s /path/sam/P_a_ater_1444.r1.sam \ -s /path/sam/P_a_atlas_Perate3.r1.sam \ -s /path/sam/P_a_sardus_5567.r1.sam \ -s /path/sam/P_rufonuchalis_90130.r1.sam sudo mysql --defaults-file=/usr/local/share/stacks/sql/mysql.cnf -e "create database paridae_all_refTgu_radtags" sudo mysql --defaults-file=/usr/local/share/stacks/sql/mysql.cnf -D paridae_all_refTgu_radtags < /usr/local/share/stacks/sql/stacks.sql load_radtags.pl -D paridae_all_refTgu_radtags -p /path/stacks/paridae_all_refTgu/ -b 1 -B -a 2014-08-29 -e "All Paridae (low coverage samples included); mapped to Tgu w bowtie (mp21fragm150-1000nomixnodisc), paired reads removed; ind=pop, 23 ind; init req coverage p=23; phyl" -c index_radtags.pl -D paridae_all_refTgu_radtags -c -t ### From this database, further filtering must be made. This can be done through rerunning the Stacks component Populations with (1) all multiallelic markers blacklisted, (2) number of snps limited to 1-10, (3) minimum number or samples required (here set to all 23), and (4) maximum 12 alleles. ### This can be achieved by combinig blacklists created with custom script based on alleles > 2, and a whitelist created with Stacks export_sql applying filters 2-4 above. Run from stacks directory. mkdir blacklists whitelists # Criterium 1, blacklist (with removal of first nonsense line): grep "Allele_2" batch_1.fa | awk '{ FS = "_" } ; { print $2 }' | sort -n | uniq | sed '1d' > ./blacklists/batch_1.multiallelic.blacklist # * NB! One CLocus may for some reason not be included although it should be. # Criteria 2-4, whitelist: export_sql.pl -D paridae_all_refTgu_radtags -b 1 -a haplo -L 5 -f ./whitelists/batch1.snps1-10.samples23.whitelist.base -o tsv \ -F snps_l=1 -F snps_u=10 -F pare_l=23 -F pare_u=23 -F alle_u=12 ## Extract unique locus lists # Criteria 2-4, whitelist unique locus extraction (with removal of first nonsense line): awk '{print $1}' ./whitelists/batch1.snps1-10.samples23.whitelist.base | sort -n | uniq | sed '1d' > ./whitelists/batch1.snps1-10.samples23.alleles1-12.whitelist # Check n of records: wc -l batch1.snps1-10.samples23.alleles1-12.whitelist ## Now compare the whitelist and the blacklist, and keep a final whitelist with only whitelist loci that were not listed in the blacklist: comm -23 ./whitelists/batch1.snps1-10.samples23.alleles1-12.whitelist ./blacklists/batch_1.multiallelic.blacklist > ./whitelists/batch1.cleaned-snps1-10.samp23.all1-12.whitelist ## Make new subdirectories for population runs according to above: mkdir batch1_original cleaned-snps1-10.samp23.all1-12 mv batch_1* batch1_original/ cp batch1_original/*catal* . ### Rerun populations, with no sql output populations -b 1 -P /path/stacks/paridae_all_refTgu/ \ -M /path/popmaps/popmap_ind_all \ -B /path/stacks/paridae_all_refTgu/blacklists/batch_1.multiallelic.blacklist \ -W /path/stacks/paridae_all_refTgu/whitelists/batch1.cleaned-snps1-10.samp23.all1-12.whitelist \ -t 8 -e 'sbfI' --phylip --phylip_var --fasta --vcf --plink --genepop mv batch_1.* cleaned-snps1-10.samp23.all1-12/ mv cleaned-snps1-10.samp23.all1-12/*catal* . #### Copy and rename the fasta output files to new analyses directory mkdir /path/analyses/ cp cleaned-snps1-10.samp23.all1-12/batch_1.fa /path/analyses/raw.fa cd /path/analyses ## All steps below can be performed in one go, until the first occurrence of '%', indicating manual action. ### Trim off the SbfI overhang sequence (TGCAGG; 6 nt) which constitute the start of all sequences. # Use FASTX-toolkit, available at http://hannonlab.cshl.edu/fastx_toolkit/download.html fastx_trimmer -f 7 -i raw.fa -o trim.fa ### Rename fasta sequence headers so that no spaces and no funky characters remain (also replace "_random" with "rand") # >CLocus_10022_Sample_10_Locus_12541_Allele_0 [chr15, 7808948, +] sed -e 's/ /_/g' -e 's/\[//g' -e 's/,//g' -e 's/+]/F/g' -e 's/-]/R/g' -e 's/_random/rand/g' < trim.fa > mod.fa ### From base fasta, create allele0 and allele1 files # Fasta files grep --no-group-separator "Allele_0" -A 1 mod.fa > all0.fa grep --no-group-separator "Allele_1" -A 1 mod.fa > all1_raw.fa ### Create allele specific sequence name list files (grep) # Sorted seq name lists grep "Allele_0" mod.fa | sort > all0_seqnames.list grep "Allele_1" mod.fa | sort > all1_seqnames.list ### Rename all1 to all0 for union comparison below sed 's/Allele_1/Allele_0/g' < all1_seqnames.list > all1_seqnames_ren_all0.list ### Compare seqnames and select records only in allele0 (comm) to file comm -23 all0_seqnames.list all1_seqnames_ren_all0.list > all0_homozyg_seqnames.list ### trim away ">" from list of seq names sed 's/>//g' < all0_homozyg_seqnames.list > all0_homozyg_seqnames_noid.list ### Lookup seqnames and retrieve seqnames AND sequences to file ## Use selectSeqs.pl, available from http://raven.iab.alaska.edu/~ntakebay/teaching/programming/perl-scripts/perl-scripts.html, saved to ./scripts ./scripts/selectSeqs.pl -f all0_homozyg_seqnames_noid.list mod.fa > all0_homozyg_raw.fa ### Rename new sequences allele0 > allele1 sed 's/Allele_0/Allele_1/' all0_homozyg_raw.fa > all0_homozyg_ren_all1.fa ### Merge with original allele1 cat all0_homozyg_ren_all1.fa all1_raw.fa > all1.fa ### Merge allele0 and allele1 cat all0.fa all1.fa > biall_unsorted.fa ### Sort on locus (catalog locus, CLocus) ## Use fastaSortByName.pl, available from http://raven.iab.alaska.edu/~ntakebay/teaching/programming/perl-scripts/perl-scripts.html ./scripts/fastaSortByName.pl biall_unsorted.fa > biall_interleaved.fa # Note that there is an error message "Replacement list is longer than search list at /usr/local/share/perl/5.14.2/Bio/Range.pm line 238." produced, but this is confirmed harmless, and the script runs through. ### Converting from interleaved to sequential fasta ## Use fa2oneline.pl, available from https://github.com/joshuaburkhart/bio/blob/master/fa2oneline.pl according to: # perl fa2oneline.pl sample.fa > out.fa : perl ./scripts/fa2oneline.pl biall_interleaved.fa > biall_sequential.fa ### Split file by locus, and renamed the sequence headers to sampleid/allele only. Note that this requires plenty of RAM and you will have to do a workaround otherwise! mkdir split cd split awk -F, '{ FS = "_" } ; {getline x; print ">" $4 "_" $8 RS x > $2"_"$9"_"$10"_"$11".fa"}' ../biall_sequential.fa rm -I ___.fa ### Double check outcome, since some loci with too few samples do slip trhough somehow. wc -l *.fa | sort -g > recordlist ##%% Check and delete erroneous loci. less recordlist #% Remove locus specific files for any loci with too few samples sequenced. wc -l *.fa | sort -g > recordlist_cleaned wc -l recordlist_cleaned # n of loci +1 mkdir fasta nexus mv *.fa fasta ### Convert fasta files to nexus files # Use script convertfasta2nex.pl available from https://sites.google.com/site/shannonhedtke/Scripts cd fasta for fafile in *.fa; do ../../scripts/convertfasta2nex.pl $fafile > ../nexus/$fafile.nex done ### Concatenate nexus files # Use a modified version of the script concatenatenexus.pl, written by Shannon Hedtke (c) and available from https://sites.google.com/site/shannonhedtke/Scripts # I modified the original script, which splits sequence names on "_" and removes all duplicates of the left hand side - something that is good for cleaning up among # genbank versions of the same sequence, but bad for removing allelles... The modified script, concatenatenexus_edMSR.pl, is supplied separately in this DataDryad entry. cd ../nexus # perl ../../scripts/concatenatenexus_edMSR.pl perl ../../scripts/concatenatenexus_edMSR.pl min23concat ## Convert the concatenated dataset to fasta as well, and move all concatenated files to a separate directory # Download ElConcatenero from https://github.com/ODiogoSilva/ElConcatenero (note that Python3 must be installed) mkdir ../concat mv min23concat* ../concat/ cd ../concat ../../scripts/ElConcatenero-master/ElConcatenero.py -c -if nexus -of fasta -in min23concat.nex ## Scrutinize alignment files by checking the beginning of them: cut -c1-50 | head -n10 ########### Rudimentary notes on preparation of specific datasets ######## Create input for SNAPP ## Convert to vcf - use https://www.biostars.org/p/94573/ # cat input.fa | java -jar /path/dist/biostar94573.jar > output.fa # cat input.msa | java -jar /home/ms/Programs/jvarkit/dist/biostar94573.jar ### Run vcf file in VCF tools - http://vcftools.sourceforge.net/man_latest.html - and set min distance between snp's (--thin) to >2*RAD tag length; minor allele count (--mac) set to >2 [depending on dataset!]: vcftools --vcf all_min23concat.vcf --out all_min23concat_spaced_mac_filtered --recode --thin 174 --mac 3 --remove-indv --remove-indv ... ## Convert to vcf tab format cat all_min23concat_spaced_mac_filtered.recode.vcf | vcf-to-tab > all_min23concat_spaced_mac_filtered.tab ### Convert back vcf to fasta using https://code.google.com/p/vcf-tab-to-fasta/. # (Note that this cannot handle multi-basepair inserions.) perl /path/scripts/vcf_tab_to_fasta_alignment.pl -i all_min23concat_spaced_mac_filtered.tab > all_min23concat_spaced_mac_filtered.fa ### Convert to nexus perl /path/scripts/convertfasta2nex.pl all_min23concat_spaced_mac_filtered_seq.fa > all_min23concat_spaced_mac_filtered_seq.nex ######## Prepare vcf/plink for MDS analyses ## Change "?" to "N" sed 's/?/N/g' < all_min23concat.fas > all_min23concat_N.fas cat all_min23concat_N.fas | java -jar ~/Programs/jvarkit/dist/biostar94573.jar > all_min23concat_N.vcf ######## Creating subsets of RADseq loci for dating ### Concatenate three autosomal subsets of 160 RAD tags and one of Z markers, to four alignments. ## Make random locus selection (e.g. in R) and write selected loci to locus lists. ## Create directory with symlinks to selected loci cd /path/analyses mkdir split/subsets split/subsets/autosomal_1 split/subsets/autosomal_2 split/subsets/autosomal_3 split/subsets/z cd split/subsets/autosomal_1 while read full_loc; do ln -s ../../fasta/$full_loc ./$full_loc done < ../../../autosomal_selection_1_locus.list cd ../autosomal_2 while read full_loc; do ln -s ../../fasta/$full_loc ./$full_loc done < ../../../autosomal_selection_2_locus.list cd ../autosomal_3 while read full_loc; do ln -s ../../fasta/$full_loc ./$full_loc done < ../../../autosomal_selection_3_locus.list cd ../z while read full_loc; do ln -s ../../fasta/$full_loc ./$full_loc done < ../../../Z_selection_locus.list ### For analyses of concatenated fasta, use http://www.abc.se/~nylander/catfasta2phyml/ cd ../ /path/Scripts/catfasta2phyml.pl -f -v autosomal_1/*.fa > 23ind160loc_autosomal1_int.fasta /path/Scripts/catfasta2phyml.pl -f -v autosomal_2/*.fa > 23ind160loc_autosomal2_int.fasta /path/Scripts/catfasta2phyml.pl -f -v autosomal_3/*.fa > 23ind160loc_autosomal3_int.fasta /path/Scripts/catfasta2phyml.pl -f -v z/*.fa > 23ind160loc_z_int.fasta # Create a sequential version as well (above interleaved) perl /path/Scripts/fa2oneline.pl 23ind160loc_autosomal1_int.fasta > 23ind160loc_autosomal1.fasta perl /path/Scripts/fa2oneline.pl 23ind160loc_autosomal2_int.fasta > 23ind160loc_autosomal2.fasta perl /path/Scripts/fa2oneline.pl 23ind160loc_autosomal3_int.fasta > 23ind160loc_autosomal3.fasta perl /path/Scripts/fa2oneline.pl 23ind160loc_z_int.fasta > 23ind160loc_z.fasta # Convert to phylip /path/Scripts/conversion_scripts/fasta2phylip.pl 23ind160loc_autosomal1.fasta > 23ind160loc_autosomal1.phy /path/Scripts/conversion_scripts/fasta2phylip.pl 23ind160loc_autosomal2.fasta > 23ind160loc_autosomal2.phy /path/Scripts/conversion_scripts/fasta2phylip.pl 23ind160loc_autosomal3.fasta > 23ind160loc_autosomal3.phy /path/Scripts/conversion_scripts/fasta2phylip.pl 23ind160loc_z.fasta > 23ind160loc_z.phy