################################################################################################################ All commands and development notes for the project are provided. This is in the interests of transparency, Also because the sophi pipeline is simple as possible and does not include commands for everying (e.g. ortholog binning). Everything described in the manuscript, can be found in here somewhere. Obviously, a lot of this document will be undecipherable, contact me if you need something specific. Note, this text file has Linux line breaks (so dont open with Windows Notepad, or it will make even less sense). DC 2016-07-29. ################################################################################################################ potential titles Construction and Utility of a Species-Level Insect Tree-of-Life by Integrating Genomic and DNA Barcode Data Partitions Construction and Utility of a Species-Level Tree-of-Life for the Insects; Integrating Genomes and DNA Barcodes Construction of a Species-Level Tree-of-Life for the Insects (class:Insecta) and Utility in Taxonomic Profiling Construction of a Tree-of-Life for the Insects (class:Insecta) Integrating Genomic and Species-Rich Sequence Data, and Utility in Taxonomic Profiling ################################# NOTES on EPA...... about EPA and LRT, required for determining significant placnecemtn in epa rather than entropy or classification files, each ambiguous, # use jplace output of raxml, this has named fields, with each described in Matsen plos 1 paper.:: 'The likelihood is the likelihood of the tree with the placement attached, which may be calculated from an alignment with columns masked out that do not appear in the read. For that reason, the log likelihood of the placement may be better (closer to zero) than the log likelihood of the reference tree on the full-length alignment. The like_weight_ratio is the ratio of that placement’s likelihood to that of the other alternate placements for that read. ' # $basic_process = 0 perl /home/douglas/usr_scripts/newick/process_raxmlEPA_outtree.pl RAxML_labelledTree.$queryfile.$refs.raxmlEPA archeopteryx doesnt shpw node labels. java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file RAxML_labelledTree.invDQ.Apoidea.raxmlEPA java -Xms64m -Xmx512m -jar /home/douglas/software/figtree/FigTree_v1.4.1/lib/figtree.jar $* QBombus_terrestris_953274475 I54* 0.330790 0.330790 QBombus_terrestris_953274475 I53 0.330771 0.661561 QBombus_terrestris_953274475 I52 0.330209 0.991770 entropy scores, correct: QBombus_cryptarum_815930634 0.000004 incorrect:QOsmia_bicornis_953274628 0.000018 so , entropy needs to be <0.00001 TF: 'I think the value 2 is a conventional empirical cutoff value to get a rough confidence interval. That means you reject models with likelihood less than 1/100 of the maximum (-2 log likelihood means something like 1/100 likelihood). a colleague gave me this command for getting the confidence interval for threshold time. but im not sure it looks right to me, i dont know where the value 2 comes from. is this correct, or what is the right way to get confidence interval for threshold time and number of species. i include the version of the script im using. many thanks, Douglas.' res <- gmyc(tree1, method="single") res$threshold.time[(max(res$likelihood)-res$likelihood)<=2] "Certain assumptions[3] must be met for the statistic to follow a chi-squared distribution, and often empirical p-values are computed." from jplace file, correct with LWR 0.999999: {"p":[[110, -12020.565494, 0.999999, 0.015008, 0.009975]], "n":["QBombus_cryptarum_815930634"]}, incorrect with LWR 0.999994: {"p":[[218, -12249.805916, 0.999994, 0.078403, 0.216786]], "n":["QOsmia_bicornis_953274628"]}, "edge_num", "likelihood", "like_weight_ratio", "distal_length", "pendant_length" # based on very quick attempt at phylo placemtn with cytb, will need to try partitioned raxml model, # with reoptimized threholds, # since all placements seem lower LWR and pedndetn length The pendant_length is the branch length for the placement edge, and distal_length is the length from the distal (away from the root) side of the reference tree edge to the placement attachment location ################################# to do. also, figure out a way to add new taxa to a backbone tree, as to retain deep level relationships for all. and try complete species tree for combined orders, see how long. build tree under various gb settings, describe each. get deep level trees for each order, then check relationships. aiming for broadly reasonable trees, free of major error. to improve trees, set 8 running, scan outputs for lnl, best one is starting tree for 8 more etc. need a histogram, number of genes vs number of species, then plot 3 points , nuclear and mito genome, and barcode, box plot like a matrix, genes as columns, species in Y, then 3 green boxs for my matrices, and box for whole # need equivelent methods for search genomes and transcriptomes. # endopterygota contains all the hyperdiverise insect orders # consider neoptera at later iteraction # manuscript may be a methods description enabling a malaise insect monitoring pipeline. # this reference made an alignment of arthropd 18s and 28s, a;though i cannot find it in supplement # Can comprehensive background knowledge be incorporated into substitution models to improve phylogenetic analyses? A case study on major arthropod relationships perl ~/usr_scripts/insert_family_name_into_newick.pl t1 key_Oct2013_Arthropoda # good tree for all insects is v difficult, may need to restrict to holometabola # final tree search could be done seperatly for orders. would make constrains easier. # need to make sure all taxa on backbone tree are present in barcode data. # im sure monophyly constraints will be outside memory of tnt also ... # will build order at a time ... # emial stamatakis, ask how to more flexible constraints in tree search # # # *** this study was started mid Dec 2014 *** # first submission to SB: 20th September 2015 # review one (reject with resubmission encouraged): 3rd november 2015 # second submission to SB after major revision: 12th Feb 2016. # review two (reject with resubmission encouraged): 2nd April 2016 # third submission to SB after revision: 19th April 2016 # review three (accept with major revision): 1st June 2016 # forth submission to SB after revision: 14th July 2016 # # 20 apr 2016 Data Files sent to dryad for supplement: Data package title: Data from: Construction of a Species-Level Tree-of-Life for the Insects and Utility in Taxonomic Profiling Provisional DOI: doi:10.5061/dryad.27114 Data files: SOPHI_v1.02 SOPHI_v1.02_PDF perl_scripts.tar pipeline_files.tar accessory_scripts.tar transcriptomes_supermatrix.nex mitogenome_supermatrix.nex specieslevel_supermatrix.nex species_level_tree_HD species_level_tree species_level_tree.processed constraints # benchmark, compare classifications to sap (used by YU), claident, bagpipe dist. porter 2014 use bayesian RDP. # get pplacer working, compare raxml epa, bagpipe dist, claident . gives four. # would be nice to have rdp and sap also. pipeline names: s o p h (hi·er·ar·chi·cal) i e Sequence Organization for Phylogenetic of Information Scalable Organization Phylogenetics Heirachical Inforamtion Systematized Partitioned Heirachically Information SOPHIe (Scalable Open-source PHylo-Informatics) Sequence Organization for Phylogenetics on HIErarchical structured data structurally Optimized PHyloInformatics Scaling Out PhyloInformatics Structurally Optimized Pipeline for Hierarchical InferEnce SOPHIE (Structurally Optimized PHyloInformatics): A scalable phyloinformatics pipeline integrating disparate data structures. https://sourceforge.net/projects/sophi/ CORRECTIONS 3 notes: 'Coestimation methods can have excellent accuracy but are too compu-tationally intensive to use on data sets with hun- dreds of genes (24). summary metthlds are widely used but accuracy is variable 'concatenating genes with high bootstrap sup- port reduces incongruence and improves resolution' typical way to adress is to generte gene trees which are summarized useing coalescent methdos. possible solution, bin gene trees then discard superalignment which are enriched for biased loci, statristical binning: https://www.ideals.illinois.edu/handle/2142/55319 the inference of species trees from gene trees is widely understaood and applied, although it needs to incorporate the measurments of various bias which were made, and just for the sake of consistency would be nice to retain the concatenation approach for the final tree, as you say this can be by an additional ortholog filtering step. further it would be nice to better describe the alternaitve topologies present and these relate to biased orthoogs, taking these into account, the solution i tried was clustering gene trees into supergene trees, then seeing if any of these alternative topologies were the result of being made by orthologs mostly deemed baising, retaining data not so for a final superamtrix analysis. one other point, the reviewer mentioned orhtolog selection as to maximizing likelihood, although i could figure out how that would be apprporiate in this context, since those type of analyses usually require fixed alignments. concatneation or summarization of individual orthologs can generate topologies conflicting with all gene trees, recent work instead allows for initial characterization of topologcail variation (binning of gene tree). herein, after binning, supergene-trees (and equivelent single gene alignments) enriched for orthologs displaying evidence of bias are discarded, then the species tree inferred on a conatentated supoarematrix of the remaining ortholgos. ################################################################################################################ # *** NUCLEAR GENOMES *** # IGNORE this section ... cd /home/douglas/scripted_analyses/COIevol/genomics/ # Xin Zhou's new insect data seems accessible via nucleotide. has no translation nor link to protein. # also translated sequenes seem a tiny minority of the entries in INV division. # thus stick to original translated blast using DNA data. # mined data used by XZ was transcriptome shotgun assemblys, such as produced by ngs # given disparate genomic data, not centraly archived, consider just using 1KITE transcriptomes. # need to format id lines, so species name primary, so can retrive .... erm ... tar -xvzf taxdump.tar.gz perl ~/usr_scripts/parse_ncbi_tax_database.pl 50557 # parse ncbi flatfiles, make a single fasta format database # put divisions into appropriate variable: gbinv #$print_genome_taxa = 1 # line 789: print MAIN_OUT ">$gi_number" , "_$current_taxid perl ~/usr_scripts/create_fasta_database_from_genbank_flatfiles.pl # consolidate file of taxonomic information, and fasta database, # create a file with just species name and accession as fasta ID: # need to remove all taxa except genomic ones # $parse_binomial_labelled_only =1; perl ~/usr_scripts/parse_taxon_from_fastafile.pl inv key_Oct2013_Insecta #$lowerlength_limit= 240; #$upper_length_limit= 1000000; perl ~/usr_scripts/preprocess_fasta_database.pl inv.parsed #makeblastdb-2.2.28+-64bit -in inv.parsed.ng.rr -dbtype nucl -parse_seqids -max_file_sz 600000000 ######################################### # IGNORE ABOVE, and as of nov 2015, ignore this secion # instead, # all refseq insects from: http://www.ncbi.nlm.nih.gov/genome/annotation_euk/all/ # Acyrthosiphon_pisum Apis_mellifera Apis_dorsata Apis_florea # Bombus_impatiens Bombus_terrestris Bombyx_mori Ceratitis_capitata Diaphorina_citri # Megachile_rotundata Microplitis_demolitor Musca_domestica # Nasonia_vitripennis Tribolium_castaneum # for holometabola, rm 1 apidae , Bombus_terrestris. and only need 1 hemiptera outgroup. rm Diaphorina_citri # Neoptera; Taxonomy ID: 33340; most winged insects, thoes that can flex wings over abdomen cd /home/douglas/scripted_analyses/COIevol/genomics/ rm endop_refseq_genera_proteins cat Acyrthosiphon_pisum Apis_mellifera Bombyx_mori Ceratitis_capitata Megachile_rotundata Microplitis_demolitor Musca_domestica Nasonia_vitripennis Tribolium_castaneum > endop_refseq_genera_proteins # parse the definition lines of fasta sequences, and make just species name and GI number # output is named endop_refseq_genera_proteins.b rm endop_refseq_genera_proteins.b perl /home/douglas/usr_scripts/parse_ncbi_deflines_fasta.pl endop_refseq_genera_proteins rm endop_refseq_genera_proteins.b.* makeblastdb-2.2.28+-64bit -in endop_refseq_genera_proteins.b -dbtype prot -parse_seqids -max_file_sz 600000000 # this program can be run to do 2 different things, i) do blast searches ii) look at blast results. # ii) doesnt currently work for this database # does this: #blastp-2.2.28+-64bit -task blastp -db endop_refseq_genera_proteins.b -out gene.411847.blastout2 -evalue 1e-6 -query /home/douglas/usr_files/annotated_reference_sequences/insecta_hmmer3-2/insecta_hmmer3-2/fa_dir/411847.fa -num_threads 1 -max_target_seqs 100000 -outfmt '6 sseqid sstart send length evalue pident sframe' # 1 argument: $database_name = $ARGV[0]; # $output_file_gene_id = 2; =sequential numbers starting from 1 # $datatype = 2; # $blast_translate = 0; # $evalue = "1e-10"; rm gene*blastout perl ~/usr_scripts/blast_core_orthologs.pl endop_refseq_genera_proteins.b # parse hits arguments: # 0:db 1:uclust_out 2:seq_ident_cutoff 3:trim 4:seq_length_lim 5:parse_which blastdbcmd # $extract_using_gi = 0; # $dbtype = "prot" # species filter options: $identified_species_only = 1; # $filter_by_MRS= 0;# 0 = take longest sequence # $trim_off_accession = 1; for i in {1..1579};do echo "$i"; rm gene.$i.blastout.retreive* perl ~/usr_scripts/parse_hits.pl endop_refseq_genera_proteins.b gene.$i.blastout 75 1 200 2 blastdbcmd-2.2.28+-64bit perl ~/usr_scripts/species_filter.pl gene.$i.blastout.retreived 1 done ### NOW progress to transcriptomes section.... ####################################################################################################### # IGNORE for i in {1..600};do echo "$i"; rm gene.$i.clo clustalo-1.1.0 --infile=gene.$i.blastout3.retreived.ID_filtered -outfile=gene.$i.clo --seqtype=Protein --outfmt=fa --threads=2 --log=clustallogfile done # hack: #@input_array = @ARGV; for my $k2(1..10) {my $string = "gene." . $k2 . ".clo"; push @input_array , $string} # optnios: #$remove_accession = 0 perl /home/douglas/usr_scripts/concatenate_v2.pl perl ~/usr_scripts/format_conversion.pl current_supermatrix2 insect_coreortho_sm.phy fasta phylip #Humphaplotropis_culaishanensis_(nomen_nudum) # no clear outgroup. so reroot at HEMIPTERA after its finished. -> Diaphorina_citri, Acyrthosiphon_pisum # no bootstrap # -F avoid final optimization # misof used BLOSUM62 model in initiial tests, then appears LG and some model test later. raxmlHPC-8.1.2 -F -s insect_coreortho_sm.phy -n insect_coreortho2 -m PROTCATGTR -c 25 -p 12345 # bootstrap mpirun -n 3 raxmlHPC-7.2.8-ALPHA-MPI -s insect_coreortho_sm.phy -n insect_coreortho_bs -m PROTCATGTR -c 25 -f a -x 12345 -p 12345 -# 100 # coleoptera position receives low support. the rest are high. perl ~/usr_scripts/insert_family_name_into_newick.pl RAxML_result.insect_coreortho key_Oct2013_Arthropoda perl ~/usr_scripts/insert_family_name_into_newick.pl RAxML_bipartitions.insect_coreortho_bs key_Oct2013_Arthropoda java -Xms64m -Xmx512m -jar /home/douglas/software/figtree/FigTree_v1.4.1/lib/figtree.jar $* #HEMIPTERA_ PSYLLIDAE_ Diaphorina_citri #HYMENOPTERA_ APIDAE_ Bombus_terrestris HYMENOPTERA_ PTEROMALIDAE_ Nasonia_vitripennis HYMENOPTERA_ BRACONIDAE_ Microplitis_demolitor HYMENOPTERA_ MEGACHILIDAE_ Megachile_rotundata HYMENOPTERA_ APIDAE_ Apis_mellifera LEPIDOPTERA_ BOMBYCIDAE_ Bombyx_mori DIPTERA_ TEPHRITIDAE_ Ceratitis_capitata DIPTERA_ MUSCIDAE_ Musca_domestica COLEOPTERA_ TENEBRIONIDAE_ Tribolium_castaneum HEMIPTERA_ APHIDIDAE_ Acyrthosiphon_pisum # *** END OF NUCLEAR GENOMES (OLD STUFF) *** ################################################################################################################ ################################################################################################################ # NOV 2015 ...... START # new approach, define putative orths based on RBH (not 1-1) of primer taxa , # search all transcriptomes in 3 outgroups, then refine each putative orth using tree based paralog remover # *** TRANSCRIPTOMES NOV 2015 *** cd ~/scripted_analyses/COIevol/corrections/transcriptomes/ perl ~/usr_scripts/taxon_parse_genbank_summary_results.pl ~/scripted_analyses/COIevol/corrections/transcriptomes/insecta_bioproject_summary.txt 50557 # [23] 7 transcriptomes downloaded from shotgun assembly archive at www.ncbi.nlm.nih.gov/Traces/wgs/ file_prefix bioproject Tax GATH01 PRJNA219530 Bittacus, mecoptera, GATG01 PRJNA219543 Corydalus cornutus ,Megaloptera GAYP01 PRJNA219547 Ctenocephalides felis, Siphonaptera GAXW01 PRJNA219555 Euroleon nostras, Neuroptera GAZH01 PRJNA219567 Inocellia crassicornis, Raphidioptera GASS01 PRJNA219594 Platycentropus radiatus, Trichoptera GAZM01 PRJNA219603 Stylops melittae, Strepsiptera HEMIPTERA_ APHIDIDAE_ Acyrthosiphon_pisum HYMENOPTERA_ APIDAE_ Apis_mellifera LEPIDOPTERA_ BOMBYCIDAE_ Bombyx_mori DIPTERA_ TEPHRITIDAE_ Ceratitis_capitata HYMENOPTERA_ BRACONIDAE_ Microplitis_demolitor HYMENOPTERA_ MEGACHILIDAE_ Megachile_rotundata DIPTERA_ MUSCIDAE_ Musca_domestica HYMENOPTERA_ PTEROMALIDAE_ Nasonia_vitripennis COLEOPTERA_ TENEBRIONIDAE_ Tribolium_castaneum # NOV 2015: neoptera; insecta, 28 incl one OUTGROUP Ephemera Machilis Atelura Folsomia Calopteryx Zorotypus Apachyus Leuctra Tetrix Grylloblatta Haploembia Timema Mantis Blaberus Mastotermes Thrips Menopon Inocellia Corydalus Euroleon Stylops Platycentropus Ctenocephalides Bittacus Bombyx Musca Tribolium Apis perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile test1 -output test1.tax -node 6960 -fasta Folsomia from Collembola , but no high level name assigned rank of order Atelura (basal insecta) from Zygentoma, but same problem neoptera to insecta additions: XXXX GAUK01000334.1| TSA: Ephemera danica; Ephemeroptera XXXX GAUM01000299.1| TSA: Machilis hrabei ; Archeognatha XXXX GAYJ01000246.1| TSA: Atelura formicaria ;Zygentoma ****# this causees problems try something else OUTGROUP: Collembola; Accession: PRJNA211850 ID: 211850; Folsomia candida strain:Berlin strain # a diplura : TSA: Thermobia OUTGROUP: GAYM01000268.1| TSA: Calopteryx splendens ;odonata GAYA01000373.1| TSA: Zorotypus caudelli ; Zoraptera GAUW01000437.1| TSA: Apachyus charteceus ; Dermaptera GAUF01000164.1| TSA: Leuctra sp.;Plecoptera GASQ01000191.1| TSA: Tetrix subulata ; orthoptera GAWP01000257.1| TSA: Grylloblatta bifratrilecta; Grylloblattodea (order) GAZA01000371.1| TSA: Haploembia palaui ; Embioptera GAVX01000368.1| TSA: Timema cristinae ; Phasmatodea GASW01000321.1| TSA: Mantis religiosa ; Mantodea GAYD01000287.1| TSA: Blaberus atropos ; Blattodea GAZE01000438.1| TSA: Mastotermes darwiniensis ; Isoptera GAXC01000389.1| TSA: Thrips palmi; Thysanoptera GAWR01000100.1| TSA: Menopon gallinae ;Psocodea GAZH01000426.1| TSA: Inocellia crassicornis ;Raphidioptera GATG01000226.1| TSA: Corydalus cornutus ; Megaloptera GAXW01000331.1| TSA: Euroleon nostras ;neuroptera GAZM01000001.1| TSA: Stylops melittae ; Strepsiptera GASS01000001.1| TSA: Platycentropus radiatus; Trichoptera GAYP01000001.1| TSA: Ctenocephalides felis; Siphonaptera GATH01000305.1| TSA: Bittacus pilicornis ; mecoptera # NEXT! download transcriptimes for the regular endops, then .. TSA; Transcriptome Shotgun Assembly. try search BIOPROJECT Bombyx AND transcriptome AND refseq; click on transcript, download fasta Accession: PRJNA205630 ID: 205630; Bombyx mori (domestic silkworm) Accession: PRJNA210139 ID: 210139; Musca domestica strain:aabys (house fly) Accession: PRJNA15718 ID: 15718; Tribolium castaneum (red flour beetle) Accession: PRJNA232132 ID: 232132; Apis dorsata (giant honeybee) mantophasmatodea is tiny order, transcriptome data is unprocessed... ignore HEMIPTERA_Acyrthosiphon bioP PRJNA29489 goes the wrong place! refseq even try this instead Accession: PRJNA251515 ID: 251515; Diaphorina citri (Asian citrus psyllid) http://www.ncbi.nlm.nih.gov/bioproject/PRJNA251515 # need multiple hemiptera and two dips # jeez, finally sumbled on where i got those from: http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=GAYY02 Neoptera; Paraneoptera; Hemiptera; Sternorrhyncha; Psylliformes; Psylloidea; Triozidae; Acanthocasuarina GAYY02: (Hemiptera; Sternorrhyncha; Psylliformes; Psylloidea; Triozidae) Acanthocasuarina muellerianae; GAUN02: ( Hemiptera; Euhemiptera; Clypeorrhyncha; Cercopoidea; Cercopidae) Cercopis vulnerata; GAUC01: ( Hemiptera; Sternorrhyncha; Aleyrodiformes; Aleyrodoidea; Aleyrodidae; Aleyrodinae) Bemisia tabaci GAXZ01: ( Diptera; Nematocera; Psychodomorpha; Trichoceroidea; Trichoceridae) Trichocera saltator ; GAXB02: ( Orthopteroidea; Mantophasmatodea; Tanzaniophasmatidae) Tanzaniophasma sp; GAUO01: ( Hemiptera; Neohemiptera; Heteroptera; Gerromorpha; Gerroidea; Veliidae) Velia caprai # need occas:GAXJ02 campodea (diplura):GAYN02, pogonognathellus (collembola):GATD02, acerentomon (protura):GAXE02 # four outgroups werer selceted from 3 major orders of non-insect hexapods, # and 2 distinct taxa from the presumed sister taxa diplura. # de novo inference of NEOPTERA orthologs from 5 species" # proteomes, download in fasta format: Neoptera; Orthopteroidea; Dictyoptera; Isoptera; Termopsidae; Termopsinae; Termopsini; Zootermopsis nevadensis: http://www.ncbi.nlm.nih.gov/protein?linkname=bioproject_protein&from_uid=203242 Neoptera; Paraneoptera; Hemiptera; Heteroptera; Euheteroptera; Neoheteroptera; Panheteroptera; Pentatomomorpha; Pentatomoidea; Pentatomidae; Pentatominae; Halyomorpha http://www.ncbi.nlm.nih.gov/protein?linkname=bioproject_protein&from_uid=298780 [Apis florea] http://www.ncbi.nlm.nih.gov/protein?linkname=bioproject_protein&from_uid=86991 [Tribolium castaneum] http://www.ncbi.nlm.nih.gov/protein?linkname=bioproject_protein&from_uid=12540 [Papilio xuthus] http://www.ncbi.nlm.nih.gov/protein?linkname=bioproject_protein&from_uid=291600 Kukalová-Peck J (1991) Fossil history and the evolution of hexapod structures. In: Naumann ID (chi fed.). The Insects of Australia. A Textbook for Students and Research Workers. Melbourne: Melbourne University Press, 141–179 Phylogeny of Basal Hexapod Lineages and Estimates of Divergence Times JEROME C. REGIER,2004 The Phylogeny of the Extant Hexapod Orders.Ward C. Wheeler, etl al Cladistics 17, 113–169 (2001 # SHOULD BE .PROT! rm *.b taxa=(Zootermopsis Halyomorpha Apis Tribolium Papilio) for i in ${!taxa[*]}; do tax=${taxa[$i]};echo "number:$i gene:$tax" #$id_format = 2, gives GI number only, this ID passes through many subsequenet steps. perl ~/usr_scripts/parse_ncbi_deflines_fasta.pl $tax.prot done # reciprocal blasts. for i in ${!taxa[*]}; do tax=${taxa[$i]};echo "number:$i gene:$tax" makeblastdb-2.2.28+-64bit -in $tax.prot.b -dbtype prot -parse_seqids done # notes on blast options, recommended for getting rhb's: # -use_sw_tback Compute locally optimal Smith-Waterman alignments? # soft masking saves computiation on db while still doing propper alignment: Schaffer et al., 2001) # soft filtering of low information segments and SW alignment detects more ortholgs # "F=s" => \$opt_F; if ($filter_string =~ /m/) {$retval .= "-soft_masking true "; # from moreno: -F ‘‘m S’’ -s T is equiv to -soft_masking true -seg yes -use_sw_tback # can run these 4 at the same time! rm *.blast_sufix rm *.ORTH* taxa2=(Halyomorpha Apis Tribolium Papilio) for i in ${!taxa2[*]}; do tax=${taxa2[$i]};echo "number:$i gene:$tax" # reciprocal blasts: blastp-2.2.28+-64bit -task blastp -db Zootermopsis.prot.b -query $tax.prot.b -out q.$tax.s.Zootermopsis.blast_sufix -max_target_seqs 1 -evalue 1e-10 -soft_masking true -seg yes -use_sw_tback -outfmt '6 qseqid sseqid evalue pident length' blastp-2.2.28+-64bit -task blastp -db $tax.prot.b -query Zootermopsis.prot.b -out q.Zootermopsis.s.$tax.blast_sufix -max_target_seqs 1 -evalue 1e-10 -soft_masking true -seg yes -use_sw_tback -outfmt '6 qseqid sseqid evalue pident length' in1=q.$tax.s.Zootermopsis.blast_sufix;in2=q.Zootermopsis.s.$tax.blast_sufix;out1=Zootermopsis.$tax.ORTH # script to get 1-1 RTH perl ~/scripted_analyses/treeoflife/scripts/reciprocal_best_hits.pl $in1 $in2 $out1 done ls -l *.ORTH.1-1 #perl ~/scripted_analyses/treeoflife/scripts/universal_orthologs.pl Zootermopsis.Apis.ORTH Zootermopsis.Papilio.ORTH Zootermopsis.Halyomorpha.ORTH Zootermopsis.Tribolium.ORTH # or 1-1, more stringent: #perl ~/scripted_analyses/treeoflife/scripts/universal_orthologs.pl Zootermopsis.Apis.ORTH.1-1 Zootermopsis.Papilio.ORTH.1-1 Zootermopsis.Halyomorpha.ORTH.1-1 Zootermopsis.Tribolium.ORTH.1-1 # jut to get a nice number!number of univ orthologs:1497: perl ~/scripted_analyses/treeoflife/scripts/universal_orthologs.pl Zootermopsis.Apis.ORTH.1-1 Zootermopsis.Halyomorpha.ORTH.1-1 Zootermopsis.Tribolium.ORTH.1-1 # bit of a puzzle. number of orthlog seems simple function of how many species used to infer them: # orthologs:1081 from inpur files:(Zootermopsis.Apis.ORTH.1-1 Zootermopsis.Papilio.ORTH.1-1 Zootermopsis.Halyomorpha.ORTH.1-1 Zootermopsis.Tribolium.ORTH.1-1) # orthologs:1659 inpur files:(Zootermopsis.Apis.ORTH.1-1 Zootermopsis.Halyomorpha.ORTH.1-1 Zootermopsis.Tribolium.ORTH.1-1) # orthologs:1970; inpur files:(Zootermopsis.Apis.ORTH.1-1 Zootermopsis.Halyomorpha.ORTH.1-1) # save as InsectPutativeOrthologs # $remove_or_retain = 2; # $regex = 1 rm Ciona.prot.b.m_rm perl ~/usr_scripts/remove_fasta_entries.pl Zootermopsis.prot.b InsectPutativeOrthologs mv Zootermopsis.prot.b.m_rm InsectPutativeOrthologs.fas # *** due to uncertainty about my own orthologs, ight be safer to used established set. perl ~/scripted_analyses/COIevol/scripts/reformat_core_ortholog_set.pl # set of orthologs is now inferred. get homologs from all insect orders ...... ################################################################################################## # finding homologs from each ome, to build a superamtrix out of # in principle should be able to get top hit for each species, #t thus its strange to mix all data in one search file. # and hard to look and interpret # instead go though each species seperatly, # then for each gene just parse one hit. # in other words, for each ortholog you need to go to each species seperatly and get the seq which is most similar # SKIP notes on refining taxon list: # transcriptomes from lesser orders: taxa=(GAYA01.1.fsa_nt GATH01.1.fsa_nt GAYP01.1.fsa_nt GAZM01.1.fsa_nt GASS01.1.fsa_nt GAXW01.1.fsa_nt GATG01.1.fsa_nt GAWR01.1.fsa_nt GAZH01.1.fsa_nt GAXC01.1.fsa_nt GAYD01.1.fsa_nt GASW01.1.fsa_nt GAZA01.1.fsa_nt GAWP01.1.fsa_nt GAVX01.1.fsa_nt GASQ01.1.fsa_nt GAUW01.1.fsa_nt GAZE01.1.fsa_nt GAUF01.1.fsa_nt GAYM01.1.fsa_nt) # main endops: taxa=(Acyrthosiphon Apis Bombyx Musca Tribolium) # basal insecta: taxa=(GAUK01.1.fsa_nt GAUM01.1.fsa_nt GAYJ01.1.fsa_nt Folsomia) # some problem seqs, switch: Thermobia outgroup instead of Folsomia; Diaphorina instead of Acyrthosiphon # yet more: taxa=(GAYY02.1.fsa_nt GAUN02.1.fsa_nt GAUC01.1.fsa_nt GAUO01.1.fsa_nt GAXB02.1.fsa_nt GAXZ01.1.fsa_nt) # rmed: HEMIPT_VELIID_Velia_caprai; HEMIPT_TRIOZI_Acanthocasuarina_muellerianae # figuring out exactly what currently have, and should have ((HEMIPTERA_CERCOPIDAE_Cercopis_vulnerata,(((PHTHIRAPTERA_MENOPONIDAE_Menopon_gallinae,THYSANOPTERA_THRIPIDAE_Thrips_palmi),((((ODONATA_CALOPTERYGIDAE_Calopteryx_splendens,EPHEMEROPTERA_EPHEMERIDAE_Ephemera_danica),(ARCHAEOGNATHA_MACHILIDAE_Machilis_hrabei,NICOLETIIDAE_Atelura_formicaria)),(((PLECOPTERA_LEUCTRIDAE_Leuctra_sp,ORTHOPTERA_TETRIGIDAE_Tetrix_subulata),((MANTODEA_MANTIDAE_Mantis_religiosa,(ISOPTERA_MASTOTERMITIDAE_Mastotermes_darwiniensis,BLATTODEA_BLABERIDAE_Blaberus_atropos)),((GRYLLOBLATTODEA_GRYLLOBLATTIDAE_Grylloblatta_bifratrilecta,MANTOPHASMATODEA_TANZANIOPHASMATIDAE_Tanzaniophasma_sp),(PHASMATODEA_Timema_cristinae,EMBIOPTERA_OLIGOTOMIDAE_Haploembia_palaui)))),(DERMAPTERA_APACHYIDAE_Apachyus_charteceus,ZORAPTERA_ZOROTYPIDAE_Zorotypus_caudelli))),(HYMENOPTERA_APIDAE_Apis_dorsata,((COLEOPTERA_TENEBRIONIDAE_Tribolium_castaneum,(RAPHIDIOPTERA_INOCELLIIDAE_Inocellia_crassicornis,(NEUROPTERA_MYRMELEONTIDAE_Euroleon_nostras,MEGALOPTERA_CORYDALIDAE_Corydalus_cornutus))),((LEPIDOPTERA_BOMBYCIDAE_Bombyx_mori,STREPSIPTERA_STYLOPIDAE_Stylops_melittae),(((TRICHOPTERA_LIMNEPHILIDAE_Platycentropus_radiatus,MECOPTERA_BITTACIDAE_Bittacus_pilicornis),SIPHONAPTERA_PULICIDAE_Ctenocephalides_felis),(DIPTERA_MUSCIDAE_Musca_domestica,DIPTERA_TRICHOCERIDAE_Trichocera_saltator))))))),HEMIPTERA_ALEYRODIDAE_Bemisia_tabaci)),HEMIPTERA_VELIIDAE_Velia_caprai,HEMIPTERA_TRIOZIDAE_Acanthocasuarina_muellerianae); # post review 1:GAXJ02.1.fsa_nt GAYN02.1.fsa_nt GATD02.1.fsa_nt GAXE02.1.fsa_nt # renamed to Occasjapyx Campodea Pogonognathellus Acerentomon, ad added to the list below #[Cercopis]+Bemisia both Hemiptera, Trichocera+[Musca] both Diptera # FINAL, complete list: taxa=(Acerentomon Apachyus Apis Atelura Bemisia Blaberus Bombyx Bittacus Calopteryx Campodea Corydalus Ctenocephalides Ephemera Euroleon Grylloblatta Haploembia Inocellia Leuctra Machilis Mantis Mastotermes Menopon Occasjapyx Pogonognathellus Platycentropus Stylops Tanzaniophasma Tetrix Timema Tribolium Thrips Trichocera Zorotypus) rm *.b rm Parse_NCBI_deflines_LOG # just process files. # some transcriptomes dont have gi's, so a unique number (9999\d+) is given to each transcript. for i in ${!taxa[*]}; do tax=${taxa[$i]};echo "number:$i gene:$tax" #rm $tax.b # WARNING ... rm command within loop like this has been known to delete files named as all strings within loop! #$id_format = 1 need different format to that above! perl ~/usr_scripts/parse_ncbi_deflines_fasta.pl $tax done # open LOG in excel to get sum transcripts:4,354,499 from 33 taxa # ~1000 orths:univORTHneoptera.fas; 1500: InsectPutativeOrthologs.fas # failure, go back to InsectaCoreOrthologs # maybe instead ... get ~5 hits for each putative orth, alignm them get tree then screen with ptp # need to account for ID formats used by ptp # compare this blasting up to 5 and taking the one if it reciprocal best hits ... # this can take at least 3 hours, perhaps split into 2 or 3. # remove blast DB's: rm *.b.nhr *.b.nin *.b.nog *.b.nsd *.b.nsi *.b.nsq # each outfile: rm *.insecta_ortholog_hits # get top hits for each ortholog; # -db_gencode 5 (insect mt) 1 (standard) for i in ${!taxa[*]}; do tax=${taxa[$i]};echo "number:$i gene:$tax" makeblastdb-2.2.28+-64bit -in $tax.b -dbtype nucl -parse_seqids tblastn-2.2.28+-64bit -db $tax.b -query InsectaCoreOrthologs -out $tax.insecta_ortholog_hits -evalue 1e-20 -num_threads 1 -max_target_seqs 1 -db_gencode 1 -outfmt '6 qseqid sseqid evalue pident length sseq' done ######################################## SKIP # may not be required .... # Post Rev1; reciprocity criteia in transcriptomes also: makeblastdb-2.2.28+-64bit -in InsectPutativeOrthologs.fas -dbtype prot -parse_seqids # delete interim and outfiles for following script: rm *.hits_fasta *.reciprocating *.recip_hits # loop through all tax again, will discard hits that dont reciprocate#tax=Occasjapyx for i in ${!taxa[*]}; do tax=${taxa[$i]};echo "number:$i gene:$tax" tblastnOUT=$tax.insecta_ortholog_hits perl ~/scripted_analyses/COIevol/scripts/reciprocal_blast.pl $tblastnOUT done # about one or 2 cercent are removed. log file is provided rm *.hits_fasta *.recip_hits # old notes: #Folsomia.insecta_ortholog_hits Acyrthosiphon.insecta_ortholog_hits #outgroup still strnage, done use one, rm Thermobia_domestica #and previous hemipteran in strange place Diaphorina_citri ######################################## rm insectaNUCL* cat *insecta_ortholog_hits > insecta_ortholog_hits.all # $dont_print_duplicates =1; $format = 3 perl ~/scripted_analyses/treeoflife/scripts/parse_ortholog_results.pl insecta_ortholog_hits.all insectaNUCL ######################################## SKIP # dont need this right now: list_ortholog_files=(insectaNUCL*) # neoptera, 3/4 is 36; for i in ${!list_ortholog_files[*]} do current_file=${list_ortholog_files[$i]};count_rows=`wc -l $current_file`;result_string=$(echo $count_rows | sed 's/\s.*//') if [ "52" -gt $result_string ]; then echo $i is $current_file not many species have this protein;rm $current_file fi done ######################################## # align each putative ortholog rm insectaNUCL*.clo list_ortholog_files2=(insectaNUCL*) no_cpus=$(cat /proc/cpuinfo | grep processor | wc -l);count=0 for i in ${!list_ortholog_files2[*]} do current_file=${list_ortholog_files2[$i]};echo current file $i is $current_file /usr/local/bin/mafft --localpair --maxiterate 100 $current_file > $current_file.clo & let count+=1 [[ $((count%no_cpus)) -eq 0 ]] && wait done # here should filter low overlap sequences... # build a single gene tree for each putative ortholog rm *.orth_ft list_ortholog_files2=(insectaNUCL*.clo) no_cpus=$(cat /proc/cpuinfo | grep processor | wc -l);count=0 for i in ${!list_ortholog_files2[*]} do current_file=${list_ortholog_files2[$i]};echo current file $i is $current_file fasttree_2.1.7 -slow -gamma $current_file > $current_file.orth_ft & let count+=1 [[ $((count%no_cpus)) -eq 0 ]] && wait done ################################# # PARALOG SCREENING; screen putative orthologs with phylotreepruner # phylotreepruner requires tree for each putative ortholog. raxml boot is very slow. do fasttree above # test:list_ortholog_files2=(insectaNUCL.WP_412626.clo) # first remove PTP outputs: rm *.clo_pruned.fa list_ortholog_files2=(insectaNUCL*.clo) for i in ${!list_ortholog_files2[*]} do current_file=${list_ortholog_files2[$i]};echo current file $i is $current_file java -cp /home/douglas/software/phylotreepruner/src_and_wrapper_scripts/ PhyloTreePruner $current_file.orth_ft 5 $current_file 0.5 u done # no interesting output from this program, it just prunes the tree and fasta file. # keep folder clean as poss: rm insectaNUCL*.clo.orth_ft # software gives no clue of the result, have to look myself: list_ortholog_files2=(insectaNUCL*.clo) rm PTP_check_results for i in ${!list_ortholog_files2[*]} do current_file=${list_ortholog_files2[$i]};echo current file $i is $current_file perl ~/scripted_analyses/COIevol/scripts/check_phylotreepruner_output.pl $current_file done ################################# # BIAS SCREEN 1; phylobayes for compositional heterogeneity; new_alignment_length:73232 # Reduced Matrix-1 (RM1) was formed as to reduce the impacts of compositional heterogeneity (Nesnidal et al. ) # run on a reduced dataset. # -s save parameters; -f overwrite outputs; -x 1 10 save freqeuncy and stopping # -comp : compositional homogeneity test (Foster, 2004); [ ]-x 0.25 1 60 rm phylobayesOUT* *.phy *.orth_ft_pruned list_ortholog_files2=(insectaNUCL*.clo) for i in ${!list_ortholog_files2[*]} do filename=${list_ortholog_files2[$i]};echo current file $i is $filename;filename_pruned=$filename'_pruned.fa' perl ~/usr_scripts/format_conversion.pl $filename_pruned $filename_pruned.phy fasta phylip fasttree_2.1.7 -slow -gamma $filename_pruned > $filename_pruned.orth_ft_pruned ~/software/phylobayes/phylobayes4.1b/data/pb -d $filename_pruned.phy -T $filename_pruned.orth_ft_pruned -ncat 1 -dgam 1 -s -f -x 1 60 phylobayesOUT.$filename ~/software/phylobayes/phylobayes4.1b/data/ppred -comp phylobayesOUT.$filename done # 1495 outputs requred:phylobayesOUT.*.clo_sample.ppht rm phylobayesOUT.*.log phylobayesOUT.*.chain phylobayesOUT.*.currenttree phylobayesOUT.*.monitor rm phylobayesOUT.*.param phylobayesOUT.*.run phylobayesOUT.*.trace phylobayesOUT.*.treelist # outputs files have this string: rm insectaNUCL.*.clo_pruned.fa.RM1 # zs -1.2=67366; -1.1=73232 ; -1.0=79769 # perl ~/usr_scripts/parse_phylobayes_results.pl zs -1.1 # 2016-06: perl ~/usr_scripts/parse_phylobayes_results.pl pv 0.05 # script scans directory for all these files (ls phylobayesOUT.*.clo_sample.ppht), # reads the overall p-value and z-scores, then if within the user threhold, # copies the matching alignment into a new file name: # ($file =~ s/phylobayesOUT\.(.+clo)_sample\.ppht/$1$string/;system("cp $file $file.RM1");) # so now you can easily scan for this subset in the local directory # 201606: #count_heterogeneous (removed):197; #count_homogeneous:1372; #percent removed:12.5 alignments=(insectaNUCL.*.RM1) rm current_supermatrix2 insectaNUCL.smatrix.RM1 partitionfile2 insectaNUCL.partitionfile.RM1 charsetfile2 insectaNUCL.charsetfile.RM1 perl ~/usr_scripts/concatenate_v2.pl -missing_data_char ? -remove_accession 2 -matrices ${alignments[*]} mv current_supermatrix2 insectaNUCL.smatrix.RM1;mv partitionfile2 insectaNUCL.partitionfile.RM1;mv charsetfile2 insectaNUCL.charsetfile.RM1 # 1271 parttions (85%) rm insectaNUCL.smatrix.RM1.epu perl ~/usr_scripts/alignment_processing/exclude_parsimony_uninformative.pl -in insectaNUCL.smatrix.RM1 -out insectaNUCL.smatrix.RM1.epu -data_type prot -remove_invar -remove_ambig perl ~/usr_scripts/format_conversion.pl insectaNUCL.smatrix.RM1.epu insectaNUCL.smatrix.RM1.epu.phy fasta phylip rm insectaNUCL.*.clo_pruned.fa.RM1 insectaNUCL.*.clo_pruned.fa.phy ################################# # BIAS SCREEN 2 , branchlength heterogeneity RM2; new_alignment_length:73051 # calculate branch length varience and total tree length for each partition , and throw some genes. # results file appendded to with each gene: # if fastrtee: rm all_branch_results # if raxml rm all_branch_results_raxml list_ortholog_files2=(insectaNUCL*.clo) for i in ${!list_ortholog_files2[*]} do filename=${list_ortholog_files2[$i]};echo current file $i is $filename; filename_pruned=$filename'_pruned.fa';fasttree_filename=$filename_pruned.orth_ft_pruned # rm table_for_R R_Variance_OUT # perl ~/usr_scripts/newick/calculate_branchlength_variance.pl $fasttree_filename # 2016-06: recaluacte using raxml trees current_raxml=/home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/raxml_201606/RAxML_bipartitions.$filename_pruned.sed rm table_for_R R_Variance_OUT perl ~/usr_scripts/newick/calculate_branchlength_variance.pl $current_raxml done # results in file:all_branch_results rm blh_filtered_files # IN R: t1<-read.table("all_branch_results", header=T, row.names=1) t1<-read.table("/home/douglas/scripted_analyses/COIevol/corrections/transcriptomes/all_branch_results_raxml", header=T, row.names=1) boxplot(t1[,2], ylab = "Branch-length variance", cex.lab = 1.4, ylim=c(0, 0.2)) # outliers > 0.112 length((1:length(t1[,2]))[t1[,2]>=0.112]) # 167 # example 20th, 40th, 60th percentile quantile(t1[,2], c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) # 10% 20% 30% 40% 50% 60% #0.003253877 0.007455152 0.011961814 0.017990158 0.026135755 0.036912844 # 70% 80% 90% #0.048577596 0.069691944 0.111240770 # # 30%=0.011961814=57011 ; 35%=0.014816516=70881; 35.5%=0.015205025=71944; 35.7%=0.015320058=73051;36%=0.015428028=74334; 50%=0.026135755=118210; new_filenames<- attr(t1[(1:length(t1[,2]))[t1[,2]<0.015320058],], "row.names") write(new_filenames, file = "blh_filtered_files",ncolumns = 1, append = FALSE, sep = " ") # 2016-06: new_filenames<- attr(t1[(1:length(t1[,2]))[t1[,2]<0.112],], "row.names") write(new_filenames, file = "blh_filtered_files_201606",ncolumns = 1, append = FALSE, sep = " ") rm *.RM2 # $TOL_version = 3; perl ~/scripted_analyses/COIevol/scripts/filename_processing.pl blh_filtered_files_201606 alignments=(insectaNUCL*.clo_pruned.fa.RM2) #$remove_accession = 2 perl ~/usr_scripts/concatenate_v2.pl ${alignments[*]} mv current_supermatrix2 insectaNUCL.smatrix.RM2; mv partitionfile2 insectaNUCL.partitionfile.RM2; mv charsetfile2 insectaNUCL.charsetfile.RM2 rm insectaNUCL.smatrix.RM2.epu perl ~/usr_scripts/alignment_processing/exclude_parsimony_uninformative.pl insectaNUCL.smatrix.RM2 insectaNUCL.smatrix.RM2.epu perl ~/usr_scripts/format_conversion.pl insectaNUCL.smatrix.RM2.epu insectaNUCL.smatrix.RM2.epu.phy fasta phylip # delete these to reduce folder size. they can be remade easily rm insectaNUCL*.clo_pruned.fa.RM2 ################################# # BIAS SCREEN 3; matrix reduction with Mare: RM3; new_alignment_length:73730 # info file gives you stuf to write in the paper. #The taxon weighting option in MARE was set to a value that retained all taxa (option 2t 1⁄4 2). This allowed direct #comparisons with the unreduced datasets. #-d [alpha]; weighting of information content in MARE (default=3.0). Generally higher #weightings lead to smaller subsets of taxa and partitions with higher information content. # outputs of phylotreepruner paralog removal: alignments=(insectaNUCL*.clo_pruned.fa) #$remove_accession = 2; $missing_data_char = "?" perl ~/usr_scripts/concatenate_v2.pl ${alignments[*]} mv current_supermatrix2 insectaNUCL.smatrix8; mv partitionfile2 insectaNUCL.partitionfile8; mv charsetfile2 insectaNUCL.charsetfile8 mare_v01.2 insectaNUCL.charsetfile8 insectaNUCL.smatrix8 -t 2 -d 1.9 > MARE_LogScreenTEST # OLD:-d 0.1=199,220 ; 0.2=164,690; 0.3=137,275; 0.4=118,127; 1.0=62,634; 1.2=54,034; 1.4=46,499; 1.5=43,784; 1.55=41998 ; 1.6=40,730 #-d 1.55=:85070; 1.8=76812; 1.9=73730 # 201606: -d=1.9 gives 450 orths; -d=1.0 gives 671; 0.5 gives 931; 0.25 1160; 0.1 1370 # there will be a new folder with results.. mv ~/scripted_analyses/COIevol/corrections/transcriptomes/results/insectaNUCL.smatrix8_reduced ~/scripted_analyses/COIevol/corrections/transcriptomes/ perl ~/usr_scripts/alignment_processing/exclude_parsimony_uninformative.pl insectaNUCL.smatrix8_reduced insectaNUCL.smatrix8_reduced.epu jpeg("MARE_optimization.jpg" , width = 400 , height = 350) # , res=300 plot(t1[,3], xlab= "Orthologs removed", ylab = "MARE Optimality Criterion", cex.lab= 1.4) dev.off() perl ~/usr_scripts/format_conversion.pl insectaNUCL.smatrix8_reduced.epu insectaNUCL.smatrix.RM3.epu.phy fasta phylip ################################# # BIAS SCREEN 4; Gblocks RM4; new_alignment_length:73466 # outputs of phylotreepruner paralog removal: alignments=(insectaNUCL*.clo_pruned.fa) #$remove_accession = 2; $missing_data_char = "?" perl ~/usr_scripts/concatenate_v2.pl ${alignments[*]} mv current_supermatrix2 insectaNUCL.smatrix8; mv partitionfile2 insectaNUCL.partitionfile8; mv charsetfile2 insectaNUCL.charsetfile8 #-b1= Minimum Number Of Sequences For A Conserved Position; (50% of the number of sequences + 1) #-b2= Minimum Number Of Sequences For A Flank Position ; (85% of the number of sequences) #-b3= Maximum Number Of Contiguous Nonconserved Positions (8) #-b4= Minimum Length Of A Block; (10) # b5=allow gap positions with half; # Gblocks_0.91b insectaNUCL.smatrix8 -t=p -b1=19 -b2=21 -b3=8 -b4=14 -b5=a # -b1=17 -b2=17 -b3=18 -b4=4 -b5=a:101388 perl ~/usr_scripts/alignment_processing/exclude_parsimony_uninformative.pl insectaNUCL.smatrix8-gb insectaNUCL.smatrix8-gb.epu perl ~/usr_scripts/format_conversion.pl insectaNUCL.smatrix8-gb.epu insectaNUCL.smatrix.RM4.epu.phy fasta phylip ################################# # 201606: filtered for miltiple biases, filter individually, the parse orthologs which are present for all of .RM1 .RM2 .RM3 rm *.RMmulti perl ~/scripted_analyses/COIevol/scripts/multiple_bias_filter.pl ls *.RMmulti | wc -l alignments=(insectaNUCL*.clo_pruned.fa.RMmulti) perl ~/usr_scripts/concatenate_v2.pl -missing_data_char ? -remove_accession 2 -matrices ${alignments[*]} mv current_supermatrix2 insectaNUCL.smatrix.RMmulti; mv partitionfile2 insectaNUCL.partitionfile.RMmulti; mv charsetfile2 insectaNUCL.charsetfile.RMmulti rm insectaNUCL.smatrix.RMmulti.epu perl ~/usr_scripts/alignment_processing/exclude_parsimony_uninformative.pl -in insectaNUCL.smatrix.RMmulti -out insectaNUCL.smatrix.RMmulti.epu -data_type prot -remove_invar -remove_ambig perl ~/usr_scripts/format_conversion.pl insectaNUCL.smatrix.RMmulti.epu insectaNUCL.smatrix.RMmulti.epu.phy fasta phylip ################################# # and unfiltered : perl ~/usr_scripts/alignment_processing/exclude_parsimony_uninformative.pl -in insectaNUCL.smatrix8 -out insectaNUCL.smatrix8.epu -data_type prot -remove_invar -remove_ambig perl ~/usr_scripts/format_conversion.pl insectaNUCL.smatrix8.epu insectaNUCL.smatrix8.epu.phy fasta phylip ################################# # 4 searches for single tree: raxmlHPC-8.1.2 -F -s insectaNUCL.smatrix.RM1.epu.phy -n insectaNUCL.RM1 -m PROTCATBLOSUM62 -c 4 -p 12345 & raxmlHPC-8.1.2 -F -s insectaNUCL.smatrix.RM2.epu.phy -n insectaNUCL.RM2 -m PROTCATBLOSUM62 -c 4 -p 12345 & wait raxmlHPC-8.1.2 -F -s insectaNUCL.smatrix.RM3.epu.phy -n insectaNUCL.RM3 -m PROTCATBLOSUM62 -c 4 -p 12345 & raxmlHPC-8.1.2 -F -s insectaNUCL.smatrix.RM4.epu.phy -n insectaNUCL.RM4 -m PROTCATBLOSUM62 -c 4 -p 12345 & wait # then 4 boots mpirun -n 4 raxmlHPC-7.2.8-ALPHA-MPI -s insectaNUCL.smatrix.RM1.epu.phy -n insectaNUCL.boot.RM1 -m PROTCATBLOSUM62 -c 4 -f a -x 12345 -p 12345 -# 250 mpirun -n 4 raxmlHPC-7.2.8-ALPHA-MPI -s insectaNUCL.smatrix.RM2.epu.phy -n insectaNUCL.boot.RM2 -m PROTCATBLOSUM62 -c 4 -f a -x 12345 -p 12345 -# 250 mpirun -n 4 raxmlHPC-7.2.8-ALPHA-MPI -s insectaNUCL.smatrix.RM3.epu.phy -n insectaNUCL.boot.RM3 -m PROTCATBLOSUM62 -c 4 -f a -x 12345 -p 12345 -# 250 mpirun -n 4 raxmlHPC-7.2.8-ALPHA-MPI -s insectaNUCL.smatrix.RM4.epu.phy -n insectaNUCL.boot.RM4 -m PROTCATBLOSUM62 -c 4 -f a -x 12345 -p 12345 -# 250 boot_results=/home/douglas/scripted_analyses/COIevol/corrections/results/RAxML_bootstrap.insectaNUCL.boot.RM1 # single thorough search is demanding, just summarize the bootstrap trees instead. sudo sumtrees.py --min-clade-freq=0.0 --log-frequency=10 --to-newick --replace --support-as-labels --burnin=0 --output=sumtreesRM1 $boot_results perl ~/usr_scripts/newick/process_raxmlEPA_outtree.pl sumtreesRM1 perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile sumtreesRM1.procd -output sumtreesRM1.procd.tax -node 6960 -newick java -Xms64m -Xmx512m -jar /home/douglas/software/figtree/FigTree_v1.4.1/lib/figtree.jar $* # results, relationships within paraneoptera and endopterygota were identical in all analyses. #perl ~/usr_scripts/format_conversion.pl $current_file $current_file.phy fasta phylip #raxmlHPC-8.1.2 -s $current_file.phy -n $current_file.PTPintree -m PROTCATBLOSUM62 -c 4 -f a -x 12345 -p 12345 -# 100 # outgroups, neop:Calopteryx_splendens; insecta: Folsomia_candida (crap), Thermobia_domestica instead # -o Thermobia_domestica # file is a little too big for model testing, rm outgroups and some similar-ish taxa just to get model: # Musca_domestica Cercopis_vulnerata Occasjapyx_japonicus Euroleon_nostras Mastotermes_darwiniensis Pogonognathellus_sp Acerentomon_sp #$regex = 0;$remove_or_retain = 1; perl ~/usr_scripts/remove_fasta_entries.pl insectaNUCL.smatrix8 rm_list_for_modeltest perl ~/usr_scripts/format_conversion.pl insectaNUCL.smatrix8.m_rm insectaNUCL.smatrix8.m_rm.phy fasta phylip # ****** # crahses compiuter, run on cluster or thin the data set a bit, rm some of the similar things .. # NEED TO BOOTSRAP: #For protein models, there is an option to automatically determine the best fitting model: raxmlHPC-8.1.2 -F -s insectaNUCL.smatrix8.m_rm.phy -n insectaNUCL_PGA.m_rm -m PROTGAMMAAUTO -c 4 -p 12345 #Automatic protein model assignment algorithm:Partition: 0 best-scoring AA model: BLOSUM62 likelihood -15126333.123473 # and need to bootstrap ... PROTCATBLOSUM62 # these are beyond my laptops, use cluster... fasttree_2.1.7 -log fasttree_logfile -nosupport -gtr -gamma -nt -constraints ftcons < endop3geneE1 > endop3geneE1.ftd # ****** #@ranksarray = (" order" , " family"); perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile RAxML_result.insectaNUCL8 -output insectaNUCL8.tax -node 6960 -newick # problem with outgroup folsomia, and hemoptera. # no basal looks great, try a few additional taxa in derived endop .... # need removing, musca domestica, cercopis, velia, acanthocasaurina # order result, for constraining mgenomes: ((ARCHAEOGNATHA,NICOLETIIDAE),((ODONATA,EPHEMEROPTERA),(((DERMAPTERA,ZORAPTERA),((PLECOPTERA,ORTHOPTERA),((MANTODEA,(ISOPTERA,BLATTODEA)),((GRYLLOBLATTODEA,MANTOPHASMATODEA),(PHASMATODEA,EMBIOPTERA))))),(((PHTHIRAPTERA,THYSANOPTERA),HEMIPTERA),(HYMENOPTERA,((COLEOPTERA,(RAPHIDIOPTERA,(NEUROPTERA,MEGALOPTERA))),((LEPIDOPTERA,STREPSIPTERA),(DIPTERA,(SIPHONAPTERA,(TRICHOPTERA,MECOPTERA)))))))))); #seems missing grom mtgenome alignment:Nicoletiidae, Zoraptera, Grylloblattodea #but is in the DB:Zygentoma; Nicoletiidae; Atelura. # Neoptera; Neoptera incertae sedis; Zoraptera; # size measured by number of informative sites after concat is the most accuate menthod in maffft, ali-score incerease singla to noise ration 'Nevertheless, whole mitochondrial genomes have been shown inappropriate to target the relationships among insect orders, mostly due to base compositional biases and unequal rates of nucleotide substitution across groups [52].' Cameron S. L., Beckenbach A. T., Dowton M., Whiting M. F. 2006. Evidence from mitochondrial genomics on interordinal relationships in insects. Arth. Syst. Phyl. 64, 27–34 polyneoptera has some studies on interordinal, endops have very few. # resulkts: cd /home/douglas/scripted_analyses/COIevol/corrections/results/ /home/douglas/scripted_analyses/COIevol/corrections/results/RAxML_bipartitions.insectaNUCL.smatrix.epu.RM1 cd ~/scripted_analyses/COIevol/corrections/transcriptomes/ # end of OMES NOV 2015 ################################################################################################################ # ################################################################################################################ # corrections June 2016 # conflicting singla was suggesed based on inferencce using matrices in which certain baising factors were removed (SUPPL FIG X). Thus a futher filtering step was conducted on the topological space from the ortholog matrix. the gene trees built from all orthologs were 'binned' according to X et al. # the software creates roughly equally sized bins countaining groups of gene trees . # the size of bins is determined by the threshold used for determination of conflict. # a pair of gene trees regarded conflicting where there is an incompativle grouping with support >X. # i set a strinent threshold of 100% support to return a managable number of bins/ suoergene trees. # support were SH based (), calcuated authomatically by fasttree. the orthologs with each bin were checked against each other for variation in the incidene of biasing characteristics, which would suggest artifactual topologies. orthologs were removed based on the resukts of this analysis. splitting / binning gene trees according to fully supported (100%) conflict gave 24 bins. there was no evidence from this test that the topolopgical variation between bins corresponded different levels of branch length varience (p = 0.749) or information content or orthologs (p = 0.540). however the analysis was suggestive of different levels of compostinal heterogeneity between varying supergene trees (p = 0.057). hence, orthologs were filtered where the compostional heterogenieyt test was signifiact (p <= 0.05). 197 (12.5 percent) heterogeneous orthologs were removed. constructuon of 4 trees in which these biasing elements are individually removed moslty congruent, apart from a small number of cases in the Polyneoptera although highly supported by bootstrap. (orthoptera and dermaptera kind of in the same place), partriculary, plecoptera and zoraptera assumbing varouis posiitons d genreted at he level or supermatrix rather than ortholog while counterintutitve , while conflicting topologies can be highly supported in variosuly constructed phylogenomic supertmatrivces (Garrison) analysis of topological variation in the dataset did not give convincing evidence that these incongruences were the result of the bias themselvs. supergene trees binned according to fully (100) or partially supported conflicts did not show consistent blah blah, since bootstap appeared not to capture incongruence acreoss orthologs in this case, a seperate measure was applied, internode uncertaintly gives support from gene trees fr nodes in a complete tree. a and c could be artifactual, due to includion of orthologs containing b l var .... no particular evidence that the topological variation was due to these,. still three primary trees were generated , i) one unfiltered, and ii) a second filtered at the level of ortholog; orthologs showing compositrional heterogenetiy tsest at p=value <0.05, orthologs outlying for branchlength legth varience (> X) or with MARE inforamtion content of $filename.sed done rm insectaNUCL*.clo_pruned.fa # need a tree corresponding to modified IDs, just get new ones: alignments=(insectaNUCL*.clo_pruned.fa.sed) rm insectaNUCL*.clo_pruned.fa.sed.ft for i in ${!alignments[*]} do filename=${alignments[$i]};echo current file $i is $filename; fasttree_2.1.7 -slow -gamma $filename > $filename.ft done # proprotion to percentage; rm insectaNUCL*.sed.ft.int for i in ${!alignments[*]} do filename=${alignments[$i]};echo current file $i is $filename; cat $filename.ft | sed -e 's/[)][0][\.][0][0][0-9]/)0/g' | sed -e 's/[)][0][\.][0]\([1-9]\)[0-9]/)\1/g' | sed -e 's/[)][0][\.]\([1-9][0-9]\)[0-9]/)\1/g' | sed -e 's/[)][1][\.][0][0][0]/)100/g' > $filename.ft.int done # raxml is preferable rm insectaNUCL*.clo_pruned.fa.sed.phy for i in ${!alignments[*]} do filename=${alignments[$i]};echo current file $i is $filename; perl ~/usr_scripts/format_conversion.pl $filename $filename.phy fasta phylip done # make bash script for i in ${!alignments[*]} do filename=${alignments[$i]}; echo mpirun -np 20 /home/wei/software/RAxML-MPI/raxmlHPC-MPI-SSE3 -D -e 1.0 -s $filename.phy -n $filename -m PROTCATBLOSUM62 -c 4 -f a -x 12345 -p 12345 -# 250 >> sub20 done # add this #PBS -N raxml_20160621 #PBS -l nodes=1:ppn=20 #PBS -q blade #PBS -l walltime=240:00:00 #PBS -d /home/wei/DC/insect_TOL_analysis/ #PBS -j oe # manually make this folder:alignments.zip scp /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/alignments.zip wei@10.0.0.181:/home/wei/DC/insect_TOL_analysis/alignments.zip scp /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/sub20 wei@10.0.0.181:/home/wei/DC/insect_TOL_analysis/sub20 ssh -l wei 10.0.0.181 # wei789 cd /home/wei/DC/insect_TOL_analysis/ unzip alignments.zip mv ./alignments/insect* . qsub sub20 # 170945.admin sed -e 's/-np 20/-np 12/' sub20 > sub21 sed -e 's/ 250/ 100/' sub21 > sub22 sed -e 's/ppn=20/ppn=12/' sub22 > sub23 #170946.admin # how many finished: ls RAxML_bipartitions.*insectaNUCL*.clo_pruned.fa.sed | wc -l # log on, mkdir raxml_201606 mv RAxML_*.insectaNUCL.WP_*.clo_pruned.fa.sed ./raxml_201606/ gzip -r raxml_201606 tar -zcvf raxml_201606.tar.gz raxml_201606/ # logout scp wei@10.0.0.181:/home/wei/DC/insect_TOL_analysis/raxml_201606.tar.gz . tar -zxvf raxml_201606.tar.gz # or this might be easier: cd /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/raxml_201606/ scp wei@10.0.0.181:/home/wei/DC/insect_TOL_analysis/raxml_201606/RAxML_bipartitions.insectaNUCL.*.clo_pruned.fa.sed . # follow instructions in the binning readme (X et al 20XX) # on creating folder and input file structure # -o 'Stylops_melittae|717937013' alignments=(insectaNUCL*.clo_pruned.fa.sed) for i in ${!alignments[*]} do filename=${alignments[$i]};echo current file $i is $filename; # option 1, raxml # mkdir /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/genes_dir_raxml/$filename # cp $filename /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/genes_dir_raxml/$filename/$filename.fasta raxml_tree=/home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/raxml_201606/RAxML_bipartitions.$filename cp $raxml_tree /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/genes_dir_raxml/$filename/RAxML_bipartitions.tree # option two, try the fasttree trees, these have support values # mkdir /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/genes_dir/$filename # cp $filename /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/genes_dir/$filename/$filename.fasta # cp $filename.ft.int /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/genes_dir/$filename/fasttree done # RAxML_bipartitions.tree # export BIN_HOME=/home/douglas/software/binning_orthologs/code export BINNING_HOME=/home/douglas/software/binning_orthologs/github/binning-master # tried simpler filenames, with / without branchlengths. although i notice the true trees in their test data are rooted. # ft: $BINNING_HOME/makecommands.compatibility.sh genes_dir 50 pairwise_out_ft_50 fasttree # rax: # $BINNING_HOME/makecommands.compatibility.sh genes_dir_raxml 95 pairwise_out_rax_95 RAxML_bipartitions.tree chmod +x commands.compat.50.genes_dir ./commands.compat.50.genes_dir cd /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/pairwise_out_ft_50/ ls| grep -v ge|sed -e "s/.50$//g" > genes_ft_50 python $BINNING_HOME/cluster_genetrees.py genes_ft_50 50 # this indicates distribution, count whats in each bin: wc -l bin.*.txt # whereas this gives number of bins: ls bin.*.txt | wc -l # 50=663; 25=689; 95=117; 99=39; 100=24 # inputs are parse_phylobayes_LOG all_branch_results MARE_LogScreen # makes R table # also need to specify binning output direcroy in the script rm sumtrees_commands gene_tree_bin.* # fasttree branchlength var: # bl_file=/home/douglas/scripted_analyses/COIevol/corrections/transcriptomes/all_branch_results # raxml branchlength var: bl_file=/home/douglas/scripted_analyses/COIevol/corrections/transcriptomes/all_branch_results_raxml pb_file=/home/douglas/scripted_analyses/COIevol/corrections/transcriptomes/parse_phylobayes_LOG mr_file=/home/douglas/scripted_analyses/COIevol/corrections/transcriptomes/MARE_LogScreen perl ~/scripted_analyses/insect_TOL_analysis/scripts/parse_ortholog_bias_scores.pl $pb_file $bl_file $mr_file r_table_201606_raxml_100 chmod +x sumtrees_commands ./sumtrees_commands cd /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/pairwise_out_rax_100/ rm gene_tree_bin.*.sum.procd* for val in {0..16} do echo $val perl ~/usr_scripts/newick/process_raxmlEPA_outtree.pl gene_tree_bin.$val.sum perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile gene_tree_bin.$val.sum.procd -output gene_tree_bin.$val.sum.procd.tax -node 6960 -newick done # fi gtree here # IN R: # 1= phylobayse p val; 2= branch length varience; 3=mare information # ft: # rax: t1<-read.table("r_table_201606_raxml_100", header=F, row.names=1) # high z-score = heterogeneity, low z-score = homogeneity, http://www.nature.com/articles/srep20528 # comparing list of values:high significance = high z-score; not signofocant = low (minus) zscore. # pphylobayes pval; branchlength var, mare bin_id<-t1[,4] phylobayes<-t1[,1] branchlen<-t1[,2] mare<-t1[,3] # tiff(filename = "fig6.tiff", width = 3200 , height = 6400, units="px", res=600, compression = "jpeg") # jpeg("fig6.jpeg" , width = 10*600 , height = 14*600, res=600) jpeg("ortholog_analysis2.jpg" , width = 1400 , height = 1300 , res=200) # png("ortholog_analysis.png", 600, 600) par(mar=c(3, 6, 3, 3)) par(mfrow=c(3,1)) boxplot(phylobayes ~ t1[,4], ylab = "PhyloBayes p-value", cex.lab = 1.4) boxplot(branchlen ~ t1[,4], ylab = "Branch-length variance", ylim = c(0,0.19), cex.lab = 1.4) boxplot(mare ~ t1[,4], ylab = "MARE information content", cex.lab = 1.4) dev.off() t2<-t1 # low val = bad; signif pval; hetero t2[,1]<-rank(t1[,1], na.last = TRUE, ties.method = "random") # low val = good t2[,2]<-rank(-t1[,2], na.last = TRUE, ties.method = "random") # low val = bad (probably); inforkatino content t2[,3]<-rank(t1[,3], na.last = TRUE, ties.method = "random") # these give bad == low rank sum_ranks<-t2[,1]+t2[,2]+t2[,3] # high number = good orthologs # 11, 13, 21, 22 points(c(12, 14, 22, 23),c(-0.01, -0.01, -0.01, -0.01),col="red", pch=1, cex=3) draw.circle(12,-0.1,1,nv=100,border=NULL,col="red",lty=1,lwd=1) rect(11.9, -0.1, 12.1, -0.09, density = NULL, angle = 45, col = "red", border = NULL, lty = par("lty"), lwd = par("lwd")) # find bins enriched for orthologs showing signfacnt compostional hetergeneity # for (val in 0:16) # for (val in c(1, 5, 8, 11, 15)) # delete retain_these_names, then for (val in c(0, 2, 3, 4, 6, 7, 9, 10, 12, 13, 14, 16)) { indices <- (1:length(bin_id))[bin_id == val]; bin_size <- length(indices) current_ranks <- sum_ranks[indices]; sum_current_ranks <- sum(current_ranks) # test5<-length(current_phylobayes [current_phylobayes <= 0.1]) / length(current_phylobayes) print(c(val, bin_size, sum_current_ranks)) new_filenames<- attr(t2[indices,], "row.names") write(new_filenames, file = "retain_these_names",ncolumns = 1, append = TRUE, sep = " ") } # retain orthologs from select bins: cd /home/douglas/scripted_analyses/COIevol/corrections/transcriptomes/ cd /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/ rm *.RMbin # $TOL_version = 3; retainfile=/home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/pairwise_out_rax_100/retain_these_names perl ~/scripted_analyses/COIevol/scripts/filename_processing.pl $retainfile alignments=(insectaNUCL*.fa.sed.RMbin) perl ~/usr_scripts/concatenate_v2.pl -missing_data_char ? -remove_accession 2 -matrices ${alignments[*]} mv current_supermatrix2 insectaNUCL.smatrix.RMbin; mv partitionfile2 insectaNUCL.partitionfile.RMbin; mv charsetfile2 insectaNUCL.charsetfile.RMbin rm insectaNUCL.smatrix.RMbin.epu perl ~/usr_scripts/alignment_processing/exclude_parsimony_uninformative.pl -in insectaNUCL.smatrix.RMbin -out insectaNUCL.smatrix.RMbin.epu -data_type prot -remove_invar -remove_ambig perl ~/usr_scripts/format_conversion.pl insectaNUCL.smatrix.RMbin.epu insectaNUCL.smatrix.RMbin.epu.phy fasta phylip # rm 5: 193963, 200433, 205781, 207862, 207562 [1] 0 93 226192 ;[1] 1 92 205781 *;[1] 2 92 212541 ; [1] 3 92 251054 ;[1] 4 92 227113 ;[1] 5 92 193963 *;[1] 6 92 233987; [1] 7 92 215969 ;[1] 8 93 207562 *;[1] 9 93 215991 ;[1] 10 93 219766; [1] 11 92 200433 *;[1] 12 93 225122 ;[1] 13 92 211140 ;[1] 14 92 208752; [1] 15 92 207862 *;[1] 16 92 231767 # 'Analysis of variance tells you if the means of all four categories are the likely the same # or if at least one mean likely differs from the others (rejection of the null hypothesis)' fit<-aov(bin_id ~ phylobayes+branchlen+mare); summary(fit) # fit<-kruskal.test( phylobayes ~ bin_id); bl on fattree, bins ft #ft 100 phylobayes 1 100 100.27 2.089 0.149; branchlen 1 32 31.61 0.659 0.417; mare 1 0 0.08 0.002 0.967 # ft 95 phylobayes 1 295 294.8 0.267 0.605; branchlen 1 587 586.6 0.532 0.466; mare 1 301 301.1 0.273 0.601 # ft 70 phylobayes 1 3731 3731 0.107 0.743; branchlen 1 50168 50168 1.444 0.230; mare 1 11074 11074 0.319 0.572 # ft 50 phylobayes 1 19841 19841 0.528 0.468; branchlen 1 52264 52264 1.390 0.239; mare 1 0 0 0.000 0.999 # branchlength var valuated from raxml gne e trees: # binned raxml tree # 50 phylobayes 1 1308 1307.9 0.523 0.470; branchlen 1 1736 1736.3 0.694 0.405; mare 1 967 966.8 0.386 0.534 # 70 phylobayes 1 177 177.2 0.537 0.464; branchlen 1 595 594.8 1.803 0.180; mare 1 502 501.8 1.521 0.218 # 95 phylobayes 1 70 70.30 1.159 0.2819 ; branchlen 1 208 208.32 3.434 0.0641 .; mare 1 0 0.41 0.007 0.9348 100 phylobayes 1 2 2.02 0.084 0.772; branchlen 1 3 3.40 0.142 0.707; mare 1 42 42.24 1.759 0.185 # next go up to bias screen one section and make the supermatix according to specifications just inferred cp /home/douglas/scripted_analyses/insect_TOL_analysis/data/sub12 sub16 cp /home/douglas/scripted_analyses/insect_TOL_analysis/data/sub12 sub21 # then edit ... # then transfer: scp /home/douglas/scripted_analyses/COIevol/corrections/transcriptomes/insectaNUCL.smatrix.RM1.epu.phy wei@10.0.0.181:/home/wei/DC/insect_TOL_analysis/insectaNUCL.smatrix.RM1.epu.phy scp /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/sub16 wei@10.0.0.181:/home/wei/DC/insect_TOL_analysis/sub16 # and unfiltered: scp /home/douglas/scripted_analyses/COIevol/corrections/transcriptomes/insectaNUCL.smatrix8.epu.phy wei@10.0.0.181:/home/wei/DC/insect_TOL_analysis/insectaNUCL.smatrix8.epu.phy scp /home/douglas/scripted_analyses/COIevol/corrections/transcriptomes/sub21 wei@10.0.0.181:/home/wei/DC/insect_TOL_analysis/sub21 # also : insectaNUCL.smatrix.RMmulti.epu.phy; insectaNUCL.smatrix.RMbin.epu.phy scp /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/insectaNUCL.smatrix.RMbin.epu.phy wei@10.0.0.181:/home/wei/DC/insect_TOL_analysis/insectaNUCL.smatrix.RMbin.epu.phy ssh -l wei 10.0.0.181 # wei789 cd /home/wei/DC/insect_TOL_analysis/ qsub sub12 qstat -Q qstat -f 170886.admin # jusr under 3 days to run on 10 processors # copy results to local directory rm RAxML_*insectaNUCL.smatrix.RM1.201606* scp wei@10.0.0.181:/home/wei/DC/insect_TOL_analysis/RAxML_*insectaNUCL.smatrix.RM1.201606* . cat RAxML_boot*insectaNUCL.smatrix.RM1.201606* > insectaNUCL.smatrix.RM1.boot sumtrees.py --min-clade-freq=0.0 --log-frequency=10 --to-newick --replace --support-as-labels --burnin=0 --output=sumtreesRM1 insectaNUCL.smatrix.RM1.boot perl ~/usr_scripts/newick/process_raxmlEPA_outtree.pl sumtreesRM1 # sumtreesRM1.procd RAxML_bipartitions.insectaNUCL.smatrix8.epu, RAxML_bipartitions.insectaNUCL.smatrix.RMmulti.epu infile=IC_.tax_complete_bootless IC_Score.RMmulti.IC10 sumtreesRMbin.procd RAxML_bipartitions.insectaNUCL.smatrix.RMmulti.epu perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile $infile -output $infile.tax -node 6960 -newick # version 1.4.2 crashes sometimes, but will print .jpg java -Xms64m -Xmx512m -jar /home/douglas/software/figtree/FigTree_v1.4.1/lib/figtree.jar $* # insectaNUCL.smatrix.RMbin.epu.phy # fasttre: cat insectaNUCL*.clo_pruned.fa.sed.ft.int > test19 # cat /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/raxml_201606/RAxML_bipartitions.insectaNUCL.*.clo_pruned.fa.sed > test20 # for gene trees, try multi-RM orthologs only, and try with/w-o boots. # and try only those gene trees with complete taxa cd /home/douglas/scripted_analyses/COIevol/corrections/transcriptomes/ all_alignments=(insectaNUCL.*.clo_pruned.fa) rm *.tax_complete for i in ${!all_alignments[*]} do filename=${all_alignments[$i]};echo current file $i is $filename; test_path=/home/douglas/scripted_analyses/COIevol/corrections/transcriptomes/$filename.RMmulti if [ -f $test_path ]; then echo ORTHOLOG NOT FILTERED retained_gene_tree=/home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/raxml_201606/RAxML_bipartitions.$filename.sed # cp $retained_gene_tree /home/douglas/scripted_analyses/COIevol/corrections/retained_genes/$filename.sed.RMmulti grep ">" $test_path | wc -l count_rows=`grep ">" $test_path | wc -l`; if [ $count_rows -gt "32" ]; then echo FULL;cp $retained_gene_tree /home/douglas/scripted_analyses/COIevol/corrections/retained_genes/$filename.sed.RMmulti.tax_complete fi fi done cd /home/douglas/scripted_analyses/COIevol/corrections/retained_genes/ cat insectaNUCL.*.clo_pruned.fa.sed.RMmulti > raxml_IC_in # try without boot values, i dont see any difference. # also not obvously better or worse using only taxonomically compelte orthologs, with/o boots all_alignments=(insectaNUCL.*.clo_pruned.fa.sed.RMmulti) all_taxcomplete_alignments=(insectaNUCL.*.clo_pruned.fa.sed.RMmulti.tax_complete) rm *.bootless for i in ${!all_taxcomplete_alignments[*]} do filename=${all_taxcomplete_alignments[$i]};echo current file $i is $filename;cat $filename | sed -e 's/[)][0-9]*/)/g' > $filename.bootless done cat insectaNUCL.*.clo_pruned.fa.sed.RMmulti.bootless > raxml_IC_in.bootless cat insectaNUCL.*.clo_pruned.fa.sed.RMmulti.tax_complete > raxml_IC_in.tax_complete cat insectaNUCL.*.clo_pruned.fa.sed.RMmulti.tax_complete.bootless > raxml_IC_in.tax_complete_bootless # wont be used:sumtreesRM1.procd; RAxML_bipartitions.insectaNUCL.smatrix.RMmulti.epu best_tree=/home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/RAxML_bestTree.insectaNUCL.smatrix.RMmulti.epu # with boots:raxml_IC_in, withuot:raxml_IC_in.bootless raxmlHPC-8.2.4 -f i -t $best_tree -z raxml_IC_in.tax_complete_bootless -n RMmulti.IC10.tax_complete_bootless -m PROTCATBLOSUM62 /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java -Xms64m -Xmx512m -jar /home/douglas/software/figtree/FigTree_v1.4.2/lib/figtree.jar $* final tree: /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/AAA.RAxML_bipartitions.insectaNUCL.smatrix.RMmulti.epu.tax.orders cd /home/douglas/scripted_analyses/COIevol/corrections/nuclear_gene_trees/ END OF corrections June 2016 ################################################################################################################ # ################################################################################################################ # old endoptertgota stuff, ignore........ misof gets: (((((MECOP,SIPHON),DIP),(LEP,TRICHOP)),((COL,STREP),((NEUROP,MEGALOP),RAPHID)))HYM) (STREP,(DIP,((SIPH,(MEGALOP,(((MECOP,TRICH),RAPH),NEURO))),(HYM,(COL,LEP)))),HEMI); (((((Trichoptera,Lepidoptera),(Diptera,(Siphonaptera,Mecoptera))),((Raphidioptera,(Megaloptera,Neuroptera)),(Coleoptera,Strepsiptera))),Hymenoptera),Hemiptera); # 6mar2015: (STREP,(DIP,((MECOP,TRICH),((((MEGALOP,RAPH),NEURO),(HYM,((HYM_Megachile,((LEP,COL),HYM_Apis)),HYM_Nasonia))),SIPH))),HEM); # as above but with colrm: # ((HYM,((((SIPHON,MECOP),DIP),(TRICH,LEP)),((RAPHID,(MEGALOP,NEUROP)),(STREP,COLEOP)))),HEMIP); # in the working folder are only the files for the taxa selected, other insects are in the sub folder: # /transcriptomes/non_holometabola_data # put transcriptome data for eaxh taxa, into a single file. rm endop_transcriptome* gunzip GA*01.1.fsa_nt.gz cat GA*01.1.fsa_nt > endop_transcriptomes gzip GA*01.1.fsa_nt # output file is input.b, this has been deleted already perl /home/douglas/usr_scripts/parse_ncbi_deflines_fasta.pl endop_transcriptomes # big file takes up too much space rm endop_transcriptomes # blast db for nucleotide makeblastdb-2.2.28+-64bit -in endop_transcriptomes.b -dbtype nucl -parse_seqids -max_file_sz 600000000 # options: # $output_file_gene_id = 2 # $datatype = 1 # $blast_translate = 1 # $evalue = "1e-10"; # does this: # tblastn-2.2.28+-64bit -db transcriptomes.b -out gene.1.blastout -db_gencode 1 -evalue 1e-6 -query /home/douglas/usr_files/annotated_reference_sequences/insecta_hmmer3-2/insecta_hmmer3-2/fa_dir/411847.fa -num_threads 1 -max_target_seqs 100000 -outfmt '6 sseqid sstart send length evalue pident sframe sseq' rm gene*blastout perl ~/usr_scripts/blast_core_orthologs.pl endop_transcriptomes.b # here you will use parse_blastout instead of parse_hits, # unlike previously, the sequence is in the blast output file for i in {1..1579};do echo "$i"; rm gene.$i.blastout.retreive* perl ~/usr_scripts/parse_blastout.pl gene.$i.blastout 200 75 perl ~/usr_scripts/species_filter.pl gene.$i.blastout.retreived 1 done ################################################################################################################ ################################################################################################################ old stuff, ignore # homologs now processed. alignment and tree building ... cd /home/douglas/scripted_analyses/COIevol/transcriptomes/ # here, combine with genome data. # # misof used mafft with L-INS-i /usr/local/bin/mafft --localpair --maxiterate 100 #/usr/local/bin/mafft --retree 2 --maxiterate 2 --op 4 --ep 0.123 mamDB.blastout.retreived.up1ps > mamDB.blastout.retreived.up1ps.mafft.retr2maxit2op4ep0.123 rm gene.*.combined # probably are none of these, they should be deleted after each loop rm gene.*.clo for i in {1..1579};do echo "$i"; fileA=/home/douglas/scripted_analyses/COIevol/genomics/gene.$i.blastout.retreived.ID_filtered fileB=/home/douglas/scripted_analyses/COIevol/transcriptomes/gene.$i.blastout.retreived.ID_filtered cat $fileA $fileB > gene.$i.combined rm gene.$i.clo #clustalo-1.1.0 --infile=gene.$i.combined -outfile=gene.$i.clo --seqtype=Protein --outfmt=fa --threads=2 --log=clustallogfile2 /usr/local/bin/mafft --localpair --maxiterate 100 gene.$i.combined > gene.$i.clo rm gene.$i.combined done # hack concatenate_v2: #####@input_array = @ARGV; #@input_array =(); #for my $k2(1..1579) # { # my $string = "gene." . $k2 . ".clo"; # if(-e $string){push @input_array , $string;print ""}else{print "alignment $string not found\n"}; # }; # optnios: #$remove_accession = 0 # check script. last seq in file is not processed same as others..... seems ok when checked mar2015 # outputs: current_supermatrix2, partitionfile2 # try with this, to make gblocks remove all the missing data columns: # missing_data_char = "-"; # change back to "N" afterwards!!! rm current_supermatrix2 rm partitionfile2 perl /home/douglas/usr_scripts/concatenate_v2.pl rm endop.all_omes mv current_supermatrix2 endop.all_omes # ignore this bit .... nees to remove these after tree is inferred. # dont want these (dont overlap with mtgenomes). delete from alignment: # Zoraptera (Zorotypus), Trichoptera (Platycentropus), Strepsiptera (Stylops), # Raphidioptera (Inocellia), Phthiraptera (Menopon), # Ephemeroptera (Ephemera), Grylloblattodea (Grylloblatta) #perl /home/douglas/usr_scripts/remove_fasta_entries.pl transcriptomes.fs rmfile #$remove_or_retain = 2 #Apis Bombyx Tribolium Musca Bittacus (Mecoptera) Ctenocephalides (Siphonaptera) Euroleon (Neuroptera) #Corydalus (Megaloptera) #trichoptera, strepsiptera missing #perl /home/douglas/usr_scripts/remove_fasta_entries.pl transcriptomes.fs retainfile # liberal gblock settings $max_contig_nonconserved (-b3=) = 10; $min_length_block (-b4=)= 5; # stringent: 8;$min_length_block = 10; # 1. Minimum Number Of Sequences For A Conserved Position: . 9 # at time of writing i was not of aware of standard command line inputs for gblocks # hence the cumbersome stuff for the settings. although i think they do have this. rm endop.all_omes-gb* #Gblocks_0.91b endop.all_omes -t=p -b2=15 -b1=12 -b3=8 -b4=10 -b5=h # 2 made, dialoowing gaps, half or all. curiously the one not allowing any, makes a longer alignment. use that one Gblocks_0.91b endop.all_omes -t=p -b5=h Gblocks_0.91b endop.all_omes -t=p -b5=a b2=Minimum Number Of Sequences For A Flank Position b1=Minimum Number Of Sequences For A Conserved Position b3=Maximum Number Of Contiguous Nonconserved Positions b4=Minimum Length Of A Block b5=Allowed Gap Positions n,h,a # old settings: #Gblocks alignment: 325032 positions (30 %) in 9017 selected block(s) # 2:16, 1:12, 3:10, 4:5 # Gblocks alignment: 106276 positions (32 %) in 461 selected block(s) # gives tree, hymeoptera not mp # ((STREPSIPTERA_Stylops_melittae:0.06189539367502596562,((DIPTERA_Musca_domestica:0.00961616714285768240,DIPTERA_Ceratitis_capitata:0.01706240774637286101):0.02818617264565642655,((MECOPTERA_Bittacus_pilicornis:0.02800603160281030929,TRICHOPTERA_Platycentropus_radiatus:0.04857137309213292731):0.04239469094751244227,((((MEGALOPTERA_Corydalus_cornutus:0.01963220428361820910,RAPHIDIOPTERA_Inocellia_crassicornis:0.02171077103343165005):0.00582898357994953223,NEUROPTERA_Euroleon_nostras:0.02112863777032552084):0.00653658619233268758,(HYMENOPTERA_Microplitis_demolitor:0.01533025300056667199,((HYMENOPTERA_Megachile_rotundata:0.00369454309960003460,((LEPIDOPTERA_Bombyx_mori:0.37007178625628561752,COLEOPTERA_Tribolium_castaneum:0.25421445966873151834):0.32962822903023680787,HYMENOPTERA_Apis_mellifera:0.05132314613213448395):0.29718435958204703073):0.77803261589467187509,HYMENOPTERA_Nasonia_vitripennis:0.01105433747263853336):0.00754719113757916504):0.01675792807345378652):0.00778876053531215619,SIPHONAPTERA_Ctenocephalides_felis:0.01788275935490130780):0.00370122494273565899):0.01658749739301467255):0.02207205408963975810):0.01338376277503407555,HEMIPTERA_Acyrthosiphon_pisum:0.01338376277503407555); # as above but with colrm: # (((HYM_Microplitis,(HYM_Nasonia,(HYM_Megachile,HYM_Apis))),((((SIPHON,MECOP),DIP),(TRICH,LEP)),((RAPHID,(MEGALOP,NEUROP)),(STREP,COLEOP)))),HEMIP); # 2:15, 1:12, 3:8, 4:10 but with gaps and missing data all as "-" durin g concatenate # Gblocks alignment: 13251 positions (4 %) in 97 selected block(s) ((HYM,((((RAPH,(MEGALOP,NEUROP)),COL),STREP),((DIP,(SIPHON,MECOP)),(TRICHOP,LEP)))),HEM); # then with few remaining gaps after globocks, turned back to N. just strep next to col now: ((HYM,(((DIP,(SIPHON,MECOP)),(TRICHOP,LEP)),((RAPHID,(MEGA,NEUROP)),(STREP,COL)))),HEMI); # complete alignment: # -b2=15 -b1=12 -b3=8 -b4=10 # 853530 positions Gblocks alignment: 42870 positions (5 %) in 261 selected block(s) #misof gets: #(((((MECOP,SIPHON),DIP),(LEP,TRICHOP)),((COL,STREP),((NEUROP,MEGALOP),RAPHID)))HYM) ((Hymenoptera,(((Diptera,(Siphonaptera,Mecoptera)),(Trichoptera,Lepidoptera)),((Raphidioptera,(Megaloptera,Neuroptera)),(Strepsiptera,Coleoptera)))),Hemiptera); # better not to use this, gblocks is oublished, widely used , and can do something similar #perl ~/usr_scripts/alignment_processing/rm_alignment_columns.pl endop.all_omes4-gb #cutoff for column removed:0.01 percent gapped #16 entries. length:115132 #115133 columns, of which 100737 removed. 14396 remaining # complete data: perl ~/usr_scripts/format_conversion.pl endop.all_omes.final endop.all_omes.final.phy fasta phylip raxmlHPC-8.1.2 -F -s endop.all_omes.final.phy -n endop.all_omes.final.complete -m PROTCATBLOSUM62 -c 4 -p 12345 -o Acyrthosiphon_pisum perl ~/usr_scripts/insert_family_name_into_newick.pl RAxML_result.endop.all_omes.final.complete key_Oct2013_Arthropoda # gblocks under mostly default, rm gapped posiiotns: perl ~/usr_scripts/format_conversion.pl endop.all_omes-gba endop.all_omes-gba.phy fasta phylip raxmlHPC-8.1.2 -s endop.all_omes-gba.phy -n endop.all_omes.gba -m PROTCATBLOSUM62 -c 4 -p 12345 -o Acyrthosiphon_pisum #$which_rank = 3;#$labels_used = 1; #1=binomial. perl ~/usr_scripts/insert_family_name_into_newick.pl RAxML_bestTree.endop.all_omes.gba key_Oct2013_Arthropoda # (early iteration had topology errors for unknown reason) # finish here #perl ~/usr_scripts/format_conversion.pl endop.all_omes-gb endop.all_omes-gb.phy fasta phylip #perl ~/usr_scripts/format_conversion.pl endop.all_omes4-gb.colrm endop.all_omes4-gb.colrm.phy fasta phylip # -F avoid final optimization # PROTCATBLOSUM62 quicker than PROTCATGTR? # "-f E": execute very fast experimental tree search #raxmlHPC-8.1.2 -f E -F -s endop.all_omes.phy -n endop.all_omes -m PROTCATBLOSUM62 -c 4 -p 12345 -o Acyrthosiphon rm RAx*endop.all_omes raxmlHPC-8.1.2 -s endop.all_omes-gb.phy -n endop.all_omes -m PROTCATBLOSUM62 -c 4 -p 12345 -o Acyrthosiphon_pisum #raxmlHPC-8.1.2 -s endop.all_omes4-gb.colrm.phy -n endop.all_omes4.gb_colrm -m PROTCATBLOSUM62 -c 4 -p 12345 -o Acyrthosiphon_pisum #$which_rank = 2; # 1=family, 2=order, 3=order+family #$labels_used = 1; #1=binomial. 2=genus only rm RAxML_bestTree.endop.all_omes.fams #perl ~/usr_scripts/insert_family_name_into_newick.pl RAxML_result.endop.all_omes4 key_Oct2013_Arthropoda perl ~/usr_scripts/insert_family_name_into_newick.pl RAxML_bestTree.endop.all_omes key_Oct2013_Arthropoda java -Xms64m -Xmx512m -jar /home/douglas/software/figtree/FigTree_v1.4.1/lib/figtree.jar $* # *** END OF TRANSCRIPTOMES *** ##################################################################################################################### ##################################################################################################################### # *** MITO-GENOMES *** almost universally titled as mitochondrion, complete genome. complete mitochondrial genome is found elsewhere in entry only 9 endops are titled "complete mitochondrial genome" , no refseqs search nucl for: ("complete" [TITLE] AND "genome" [TITLE] AND ("mitochondrial" [TITLE] OR "mitochondrion" [TITLE])) AND endopterygota [organism] ################################################################################################################# cd ~/scripted_analyses/COIevol/corrections/mtgenomes/ #WARNING, somewhere the species names are getting truncated .... OCT 2015: search nucl for: ("complete" [TITLE] AND "genome" [TITLE] AND ("mitochondrial" [TITLE] OR "mitochondrion" [TITLE])) AND insecta [organism] insects (1893) moths (569) butterflies (271) more... (298) flies (374) Drosophilidae (110) more... (264) grasshoppers &c. (248) grasshoppers (200) more... (48) bugs (222) beetles (123) hymenopterans (91) termites (60) Neuroptera (31) Megaloptera (22) dragonflies & damselflies (22) walking sticks (20) mayflies (14) stoneflies (13) Archaeognatha (13) roaches (12) lice (11) more... (48) this order does not have a complete mtgenome, but has a partial one DQ241796.1 GI:82792109 Orthopteroidea; Grylloblattodea; Grylloblattidae; Grylloblatta or less stringent, complete or partial ("genome" [TITLE] AND ("mitochondrial" [TITLE] OR "mitochondrion" [TITLE])) AND insecta [organism] 2273 items saved as insecta_partial_mtgenome_summary ignore above, you wish to keep complete genomes, then add partial genomes if they are from a genus not present in the completes. so: ("partial" [TITLE] AND "genome" [TITLE] AND ("mitochondrial" [TITLE] OR "mitochondrion" [TITLE])) AND insecta [organism] insecta_mtgenome_partialONLY_summary save summary file then parse: Hexapoda:6960 #$level = 3;# =genus perl ~/usr_scripts/taxon_parse_genbank_summary_results.pl insecta_summary.txt 6960 # 600 genus level flatfile saved as insecta_mtgenome_genus # or incl partial genomes: perl ~/usr_scripts/taxon_parse_genbank_summary_results.pl insecta_partial_mtgenome_summary 6960 # 816 genus level for complete plus partial genomes # or ONLY partials: perl ~/usr_scripts/taxon_parse_genbank_summary_results.pl insecta_mtgenome_partialONLY_summary 6960 # 259 at genus level saved as insecta_partial_mtgenomes.gb # read file of partial genomes, see what genera are there file1=/home/douglas/scripted_analyses/COIevol/corrections/mtgenomes/insecta_mtgenome_partialONLY_summary.taxon_parse_genbank_summary_results_LOG perl ~/scripted_analyses/COIevol/scripts/add_partial_mtgenomes.pl $file1 insecta_mtgenome_genus OUTFILENAME # genera with a partial genome and no complete genome: partials=(Cardiochiles Ischnusia Leptomyxotermes Sipyloidea Cephalonomia Tipula Jugositermes Orophyllus Haliplus Polistes Limnichidae Coridius Urodus Ceutorhynchus Ips Peltodytes Deporaus Parocyusa Necydalis Promirotermes Labiotermes Teredus Monomachus Epiphyas Harmonia Histeromerus Brachycerus Dysstroma Diceroprocta Serritermes Liophloeus Cephalotermes Hydroporus Cordulia Diachasmimorpha Platystomos Philotrypesis Tenthredo Noteridae Ethmia Procubitermes Hellula Alucita Orsodacne Embiratermes Disteniazteca Ancistrotermes Bittacomorphella Rhopalapion Glyptotermes Acanthotermes Acronicta Micropterix Diadromus Donacia Diplonychus Euurobracon Styloperla Aphelocheirus Daktulosphaira Termitogeton Sericostoma Rubiconia Larinus Chromatomyia Eumacrocentrus Micrambina Homolobus Cavitermes Enicospilus Strophosoma Basidentitermes Cryptotermes Parrhinotermes Eoxenos Glossotermes Dermestes Ameletus Sumalia Therophilus Rhodoneura Zeuzera Monocellicampa Hypomedon Caetetermes Propylea Apoderus Pollenia Stigmella Tettigades Euryomma Prorhinotermes Duplidentitermes Sigara Anthonomus Pambolus Prioninae Sitona Emmelina Potamanthus Stenurella Polyphylla Grylloblatta Oegoconia Trigonotylus Pedetontinus Glaphyrus Asphondylia Noctua Pselaphanus Astalotermes Abraeinae Odontotermes Capitonius Copelatus Endomychus Orestes Ceratosolen Exema Himacerus Meloidae Tychus Camponotus Leptopus Cionus Kateretes Astrotischeria Syntermes Phaedon Teslasena Xyletinus Sphaerotermes Triraphis Synacanthotermes Endrosis Sinocapritermes Tineola Pterocomma Macrocentrus Sigalphus Magicicada Arescus Doa Kyklioacalles Quercusia Hodotermopsis Hypera Polydrusus Silvanus Bradysia Enicmus Eudonia Bacillus Orthognathotermes Trinodes Vesperus Byrrhus Anoplotermes Apatania Dolichorhinotermes Aphidius Epermenia Brachytemnus Phanerotoma Perimede Doydirhynchus Coenonympha Barynotus Gaeana Amalotermes Drepana Osphya Aphantopus Inachis Salpingus Vespula Polyrhachis Ichneutes Spilopyra Ateuchotermes Apatelopteryx Pericapritermes Platypus Postsubulitermes Nanophyes Euplectus Phyllonorycter Meru Imatidium Pennisetia Timema Aderitotermes Teloganodidae Nargus Cheilomenes Leptocorisa Lacosoma Promecognathus Constrictotermes Otiorhynchus Neohirasea Cameraria Hylobius Rugitermes Ephemerella Acanthormius Silvestritermes Thremma Mesocapnia Phyllium) # $how_many_to_print = 1; () rm *.partial_mt for i in ${!partials[*]}; do tax=${partials[$i]};echo "number:$i genus:$tax" perl ~/usr_scripts/fetch_entries_from_genbank_flatfile.pl insecta_partial_mtgenomes.gb $tax.partial_mt $tax done cat *.partial_mt > all.partial_mts rm *.partial_mt # finally get 3 outgroups: Diplura:NC_022674.1; (Protura) NC_026666.1 / 769837779 ; (Collembola,Entomobryomorpha) NC_010533.1 / 171473634 NC_022674.1,NC_026666.1,NC_010533.1 save as mt_outgroups.gb cat insecta_mtgenome_genus all.partial_mts mt_outgroups.gb > insecta_genus.complete_and_partial_MT_OG # try method used for Dikarya mtgenomes, # name parse several refseqs, quickly curate, then use these to blast all ("complete" [TITLE] AND "genome" [TITLE] AND ("mitochondrial" [TITLE] OR "mitochondrion" [TITLE])) AND insecta [organism] AND refseq #$level = 1;# =order perl ~/usr_scripts/taxon_parse_genbank_summary_results.pl insecta_refseq_mtgenome_summary.txt 6960 save as insecta_refseq_mtgenome_order #output: 228 fams found ;543 gen found ;27 orderss found # $how_many_to_print = 1; () taxa=(Archaeognatha Blattodea Coleoptera Dermaptera Diptera Ephemeroptera Hemiptera Hymenoptera Isoptera Lepidoptera Mantodea Mantophasmatodea Mecoptera Megaloptera Neuroptera Odonata Orthoptera Phasmatodea Phthiraptera Plecoptera Psocoptera Raphidioptera Siphonaptera Strepsiptera Thysanoptera Trichoptera Zoraptera) rm *refseq_insect_mt for i in ${!taxa[*]}; do tax=${taxa[$i]};echo "number:$i gene:$tax" perl ~/usr_scripts/fetch_entries_from_genbank_flatfile.pl insecta_refseq_mtgenome_order $tax.refseq_insect_mt $tax done # $id_format = 5; $organize_outfiles_by_genename = 1; $dont_print_duplicate_seq = 0;$prefered_field_for_parsing_gene_labels = 1 rm InsMT* for i in ${!taxa[*]}; do tax=${taxa[$i]};echo "number:$i gene:$tax" perl ~/usr_scripts/parse_translations_from_genbank_flatfile.pl -in $tax.refseq_insect_mt -out $tax.refseq_insect_mt.prot -out_suffx InsMT done # manually remove some badly annotated files # $organize_outfiles_by_genename = 0; $id_format = 4 (arbitrary sequence count as ID, protein IDs can be repeated thus not useful); # from here, try complete mtgenomes, or compelte and partial #perl ~/usr_scripts/parse_translations_from_genbank_flatfile.pl -in insecta_mtgenome_genus -out insecta_mtgenome_genus.prot -out_suffx NULL #perl ~/usr_scripts/parse_translations_from_genbank_flatfile.pl -in insecta_genus.complete_and_partial_MT -out insecta_genus.complete_and_partial_MT.prot -out_suffx NULL rm insecta_genus.complete_and_partial_MT_OG.prot perl ~/usr_scripts/parse_translations_from_genbank_flatfile.pl -in insecta_genus.complete_and_partial_MT_OG -out insecta_genus.complete_and_partial_MT_OG.prot -out_suffx NULL rm insecta_genus.complete_and_partial_MT_OG.prot.p* makeblastdb-2.2.28+-64bit -in insecta_genus.complete_and_partial_MT_OG.prot -dbtype prot -parse_seqids rm *InsMT_hits InsMT_proteins=(InsMT*) for i in ${!InsMT_proteins[*]} do current_file=${InsMT_proteins[$i]} echo current file $i is $current_file blastp-2.2.28+-64bit -task blastp -db insecta_genus.complete_and_partial_MT_OG.prot -query $current_file -out $current_file.InsMT_hits -max_target_seqs 10000 -evalue 1e-10 -outfmt '6 qseqid sseqid evalue pident length sseq' done cat *InsMT_hits > InsMT.allhits # retrieve hits for each ortholog, then MSA BETTER TO USE protein id UNIQUE strings for each scaffold segment rm InMT* # string added to start of each ortholog ID, for each gene includes retreived fasta and alignment # $format = 3 (binomial | arbitrary_unique_ID); $dont_print_duplicates = 1; with multiple queries for each gene, there will be multiple hits for each subject-gene. perl ~/scripted_analyses/treeoflife/scripts/parse_ortholog_results.pl InsMT.allhits InMT # ALIGNMENT (slowish) no_cpus=$(cat /proc/cpuinfo | grep processor | wc -l);count=0 rm InMT*.clo split_query_files=(InMT*) for i in ${!split_query_files[*]} do current_file=${split_query_files[$i]} echo current file $i is $current_file /usr/local/bin/mafft --localpair --maxiterate 100 $current_file > $current_file.clo & let count+=1 [[ $((count%no_cpus)) -eq 0 ]] && wait done # SINGLE GENE TREES (quite quick) rm InMT*.clo.SGT alignment_files=(InMT*.clo) for i in ${!alignment_files[*]} do filename=${alignment_files[$i]};echo current file $i is $filename; fasttree_2.1.7 -slow -gamma $filename > $filename.SGT done # PARALOG SCREENING (quick) rm *.clo_pruned.fa for i in ${!alignment_files[*]} do current_file=${alignment_files[$i]};echo current file $i is $current_file java -cp /home/douglas/software/phylotreepruner/src_and_wrapper_scripts/ PhyloTreePruner $current_file.SGT 5 $current_file 0.5 u done pruned_alignments=(InMT*.clo_pruned.fa) rm current_supermatrix2 InMT.smatrix partitionfile2 InMT.partitionfile charsetfile2 InMT.charsetfile #$missing_data_char = "?"; $remove_accession = 2 perl ~/usr_scripts/concatenate_v2.pl ${pruned_alignments[*]} mv current_supermatrix2 InMT.smatrix; mv partitionfile2 InMT.partitionfile; mv charsetfile2 InMT.charsetfile perl ~/usr_scripts/format_conversion.pl InMT.smatrix InMT.smatrix.phy fasta phylip # cosntrain mtgenome tree using random samples of the 1000 nuclear ome boot trees cat ~/scripted_analyses/COIevol/corrections/results/RAxML_bootstrap.insectaNUCL.boot.RM* > ~/scripted_analyses/COIevol/corrections/mtgenomes/insectaNUCL.all_boots sort -R -o insectaNUCL.all_boots.rand insectaNUCL.all_boots # each boot tree in a different file: # wrapper script, does some minor processing, and for each ome backbone replicate, # runs the script for placing mtgenome data into this:COIevol/scripts/parse_topology_for_constraints ... # runs slowly! unfortunately ... inefficient since it reads the taxonomy db each time. rm backbone_constrained.BOOT* # previous:$supermatrix InMT.smatrix **** perl ~/usr_scripts/process_backbone_tree_for_setting_constraints.pl -newick -node 6960 -boot_trees insectaNUCL.all_boots.rand -supermatrix XXX -output backbone_constrained # bootstrap the mtgenome alignment ... rm InMT.smatrix.phy.BS* raxmlHPC-8.1.2 -f j -s InMT.smatrix.phy -n NULL -m PROTCATMTART -b 12345 -# 1000 # does not use much memory, can do several at once. no_cpus=$(cat /proc/cpuinfo | grep processor | wc -l);count=0 rm RAxML*InsMito_B* for i in {1..250} do raxmlHPC-8.1.2 -D -e 1.0 -F -s InMT.smatrix.phy.BS$i -n InsMito_B$i -m PROTCATMTART -p 12345 -g backbone_constrained.BOOT$i & let count+=1 [[ $((count%no_cpus)) -eq 0 ]] && wait done # done a regular raxml search for each mtgenome boot aligment, constrained by booted ome topology, # combine trees to single file and get consensus rm InsMito.all_raxml_boots InsMito_sumtrees InsMito_sumtrees.procd InsMito_sumtrees.procd.tax cat RAxML_result.InsMito_B* > InsMito.all_raxml_boots sumtrees.py --min-clade-freq=0.0 --log-frequency=10 --to-newick --replace --support-as-labels --burnin=0 --output=InsMito_sumtrees InsMito.all_raxml_boots perl ~/usr_scripts/newick/process_raxmlEPA_outtree.pl InsMito_sumtrees # draw a version of the tree which has taxonomic names on the internal nodes, # so you can see placement errors, these will be long deep level taxon strings, near the tips. #$my_node_labells = 0;$trucacte_node_taxnames = 1; perl ~/scripted_analyses/COIevol/scripts/draw_mtgenome_tree.pl InsMito_sumtrees.procd #@ranksarray = (" order", " family"); $process_backbone_tree_terminal_IDs = 2; perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile InsMito_sumtrees.procd -output InsMito_sumtrees.procd.tax -node 6960 -newick java -Xms64m -Xmx512m -jar /home/douglas/software/figtree/FigTree_v1.4.1/lib/figtree.jar $* java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file InsMito_sumtrees.procd.tax # UNconstrained mtgenome: rm InsMito_UNconstr.all_raxml_boots InsMito_UNconstr.sumtrees InsMito_UNconstr.sumtrees.procd InsMito_UNconstr.sumtrees cat RAxML_result.InsMito_UNconstr_B* > InsMito_UNconstr.all_raxml_boots sumtrees.py --min-clade-freq=0.0 --log-frequency=10 --to-newick --replace --support-as-labels --burnin=0 --output=InsMito_UNconstr.sumtrees InsMito_UNconstr.all_raxml_boots perl ~/usr_scripts/newick/process_raxmlEPA_outtree.pl InsMito_UNconstr.sumtrees perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile InsMito_UNconstr.sumtrees.procd -output InsMito_UNconstr.sumtrees.procd.tax -node 6960 -newick # ALSO require single tree with branchlengths perl /home/douglas/usr_scripts/newick/add_branchlengths.pl RAxML_result.InsectMT_BOOT1 cat RAxML_result.InsectMT_BOOT* > InsectMT_BOOTall #raxmlHPC-8.1.2 -f b -s InMT.smatrix.ord.phy.BS1 -n InsectMT_BOOTbipart -m PROTCATMTART -p 12345 -t RAxML_result.InsectMT_BOOT1.with_branchlengths -z InsectMT_BOOTall #"-f b": draw bipartition information on a tree provided with "-t" based on multiple trees # (e.g., from a bootstrap) in a file specified by "-z" raxmlHPC-8.1.2 -s InMT.smatrix.ord.phy -n InMT.smatrix.ord2 -m PROTCATMTART -c 4 -p 12345 -g mtprotein_treesearch_constraints perl ~/usr_scripts/parse_ncbi_tax_database.pl 6960 # $which_rank = 1;$labels_used = 2 perl ~/usr_scripts/insert_family_name_into_newick.pl RAxML_bestTree.InMT.smatrix.ord key_Oct2013_Hexapoda perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile RAxML_bipartitions.InsectMT_BOOTbipart -output RAxML_bipartitions.InsectMT_BOOTbipart.tax -node 6960 -newick #seems missing grom mtgenome alignment:Nicoletiidae, Zoraptera, Grylloblattodea #but is in the DB:Zygentoma; Nicoletiidae; Atelura. # Neoptera; Neoptera incertae sedis; Zoraptera; perl ~/usr_scripts/format_conversion.pl InMT.smatrix InMT.smatrix.phy fasta phylip raxmlHPC-8.1.2 -F -s InMT.smatrix.phy -n InMT -m PROTCATMTREV -c 4 -p 12345 perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile RAxML_result.InMT.smatrix.ord -output A.RAxML_result.InMT.smatrix.ord.tax -node 6960 -newick java -Xms64m -Xmx512m -jar /home/douglas/software/figtree/FigTree_v1.4.1/lib/figtree.jar $* cd ~/scripted_analyses/COIevol/corrections/mtgenomes/ # *** END oF MITOGENOMES NOV2015 *** ################################################################################################################# send to file, genbank full AND 1 hemiptera! , manually cat this to the file Acyrthosiphon pisum [organism] AND ("mitochondrion" [TITLE] AND "complete genome" [TITLE]) saved as mtcomplete_endop.gb and put in folder /COIevol/mtgenomes/ # *** somewhere, remove sub-species names. and Phloeomyzus and Eriosoma [doesnt matter for endopterygota] cd /home/douglas/scripted_analyses/COIevol/mtgenomes/ # outputs, inv and prot rm inv* prot perl ~/usr_scripts/create_fasta_database_from_genbank_flatfiles.pl # consider genus level ...# do you need 5 papilio's etc... # advantage might be, any monophylyl constraint, even small. is going to improve and speed up the barcode tree. # or ... maybe combine with nuclear genome data, making genus level chimeras to optimize phylogenetic information. # select $trim_off_accession = 0; perl /home/douglas/usr_scripts/species_filter.pl inv 1 # dont need to do this:remove underscore NC_ manually from BOTH files # make sure you have this line:my @zipped_genbank_flatfiles = ("mtcomplete_endop"); #$excised_seq_length_limit= 60;#$accession_or_gi= 2; # 1=accession; 2=gi number#$parse_protein = 1; rm nameparsed.* perl ~/usr_scripts/multiple_sequence_splitter.pl inv.ID_filtered mtgenes=(ATP6 CO1 CO2 CO3 CYTB NAD1 NAD2 NAD3 NAD4 NAD4L NAD5 NAD6) # misof used mafft. whats good enough for a science paper, must work well rm *.clo for i in ${!mtgenes[*]}; do gene=${mtgenes[$i]};echo "number:$i gene:$gene" /usr/local/bin/mafft --localpair --maxiterate 100 nameparsed.$gene > $gene.clo done # check problem sequences: hemipterans Phloeomyzus and Eriosoma .... these two mostly missing data # CONCATENATE, MAKE MiTo-GENOME SUPERMATRIX. # ** manually change DNA to MTREV in the partition file after concatenation # $remove_accession = 1; # $genus_level = 1; rm current_supermatrix2 mtprotein rm partitionfile2 mtprotein_partitions perl /home/douglas/usr_scripts/concatenate_v2.pl ATP6.clo CO1.clo CO2.clo CO3.clo CYTB.clo NAD1.clo NAD2.clo NAD3.clo NAD4.clo NAD4L.clo NAD5.clo NAD6.clo mv current_supermatrix2 mtprotein mv partitionfile2 mtprotein_partitions # manually check the names, binomials only, easier on this file: #perl ~/usr_scripts/format_conversion.pl mtprotein mtprotein.phy fasta phylip # four name problems, conveniently near the start of the file # *** manually RM from mtprotein: Chaetosoma, Priasilpha, Trachypachus, Ceratitis_capitata, Antheraea_pernyi, Troides_aeacus #Humphaplotropis_culaishanensis_(nomen_nudum) # mar2015, new all_omes tree: # ((HYMENOPTERA,(((DIPTERA,(SIPHONAPTERA,MECOPTERA)),(TRICHOPTERA,LEPIDOPTERA)),((RAPHIDIOPTERA,(MEGALOPTERA,NEUROPTERA)),(STREPSIPTERA,COLEOPTERA)))),HEMIPTERA); in new tree, seems exactly the same as found in fasta protein mtgenomes file! COLEOPTERA DIPTERA HEMIPTERA HYMENOPTERA LEPIDOPTERA MECOPTERA MEGALOPTERA NEUROPTERA RAPHIDIOPTERA SIPHONAPTERA STREPSIPTERA TRICHOPTERA # parse genome tree. gives constraints for mtgenome search #nucl_genome_tree=/home/douglas/scripted_analyses/COIevol/genomics/RAxML_result.insect_coreortho2 # not actually used, write tre in script. nucl_genome_tree=/home/douglas/scripted_analyses/COIevol/transcriptomes/RAxML_bestTree.endop.all_omes.orders # check why some stuff is being thrown out by this script..... # corrected 26mar2015 ... reads up to date ncbi tax db instead of out of date processed tax file. mtgenome_fastafile=mtprotein rm mtprotein.ord mtprotein.phy perl ~/scripted_analyses/COIevol/scripts/parse_topology_for_constraints.pl $mtgenome_fastafile 50557 # that script also tells you how many of each order in the file: #order in fasta file:Coleoptera count:43; Diptera count:94; Hemiptera count:1; Hymenoptera count:30 #Lepidoptera count:200; Mecoptera count:3; Megaloptera count:9; Neuroptera count:14; Raphidioptera count:1 #Siphonaptera count:1; Strepsiptera count:1; Trichoptera count:2 perl ~/usr_scripts/format_conversion.pl mtprotein.ord mtprotein.phy fasta phylip contraints=mtprotein_treesearch_constraints # -q mtprotein_partitions;# -F avoid final optimization # PROTCATBLOSUM62 quicker than PROTCATGTR? # "-f E": v fast search, does NOT work with constraints # partition model seems to slow things a lot. # -o Archaeognatha_Pedetontus , Hemiptera_Acyrthosiphon_pisum # quick, and doesnt do what we need... raxmlHPC-8.1.2 -D -F -e 1.0 -s mtprotein.phy -n mtprot_quick_constr -m PROTCATMTREV -c 4 -p 12345 -g $contraints # no bootstrap rm RAx*mtprot_constr.6_tax_rmMAY # phy_input output! raxmlHPC-8.1.2 -s mtprotein.phy -n mtprot_constr.6_tax_rmMAY -m PROTCATMTREV -c 12 -p 12345 -g $contraints -o Hemiptera_Acyrthosiphon_pisum #$which_rank= 1;# 1=family,$labels_used = 1; perl ~/usr_scripts/insert_family_name_into_newick.pl RAxML_bestTree.mtprot_constr.6_tax_rm key_Oct2013_Arthropoda # no bootstrap, partitioned: rm RAx*mtprot_constr_partit raxmlHPC-8.1.2 -s mtprotein.phy -n mtprot_constr_partit -m PROTCATMTREV -q mtprotein_partitions -c 12 -p 12345 -g $contraints -o Hemiptera_Acyrthosiphon_pisum # bootstrap , without parttioin -q mtprotein_partitions .... is unexpectidly slow, even just the final search rm RAx*mtprot_bs_constr mpirun -n 2 raxmlHPC-7.2.8-ALPHA-MPI -s mtprotein.phy -n mtprot_bs_constr -m PROTCATMTREV -c 12 -f a -x 12345 -p 12345 -# 100 -g $contraints -o Hemiptera_Acyrthosiphon_pisum # gblocked. partition file wont work with this rm mtprotein.ord-gb* Gblocks_0.91b mtprotein.ord -t=p # -t=p means type == protein; -b2=15 -b1=12 -b3=8 -b4=10 -b5=h perl ~/usr_scripts/format_conversion.pl mtprotein.ord-gb mtprotein.ord-gb.phy fasta phylip rm RAx*mtprot_constr.ord.GB raxmlHPC-8.1.2 -s mtprotein.ord-gb.phy -n mtprot_constr.ord.GB -m PROTCATMTREV -c 12 -p 12345 -g $contraints -o Hemiptera_Acyrthosiphon_pisum # already has order names. but you could put in family names also # $which_rank = 1 # RAxML_bestTree.mtprot_constr.6_tax_rm RAxML_bestTree.mtprot_constr_partit RAxML_bestTree.mtprot_constr.ord.GB ....mtprot_bs_constr besttree=RAxML_bestTree.mtprot_constr.ord.GB perl ~/usr_scripts/insert_family_name_into_newick.pl $besttree key_Oct2013_Arthropoda java -Xms64m -Xmx512m -jar /home/douglas/software/figtree/FigTree_v1.4.1/lib/figtree.jar $* # COL: rm Priasilpha_obscura, Chaetosoma_scaritides, Trachypachus_holmbergi(adephaga) DIP: Ceratitis_capitata (Brachycera) nested within the Anopheles (Nematocera), with notable large branch. LEP: Antheraea_pernyi, Troides_aeacus. (Obtectomera) nested within the Tortricidae clade. # gblocked tree is quite different from others. # partitioned and non-partitioned are very similar expect for some deep level DIP relatinshpis # TEST, tried to improve mtgenome tree, basal coleoptera have mistakes. # manually make file with colopetera and strep only # take from mtprotein.phy # constraints dont apply for mtgenome analysis within order. # basic and partitioned, gblocked tried , all had same error, so ust removed the offending taxa??? # *** END OF MITO GENOMES *** ################################################################################################################ # *** BARCODING *** ################################################################################################################ # NOV 2015 # define species dense markers using CZ-2014 cd /home/douglas/scripted_analyses/COIevol/corrections/species_dense/ # silly problem , this db has lowercase dna sequence, which confuses some software later (prank im looking at you) perl ~/usr_scripts/capitalize_uppercase_dna_seqs.pl inv.parsed.ng.rr # sample database, from this sample we will infer a set of species dense homologs taxkey=/home/douglas/scripted_analyses/how_many_insects/corrections/corrections2/results_files/insecta.taxkey # $sample_this_db; $number_per_subfamily; $taxonomic_or_random_sampling; $outfile; $keyfile rm sample_db_OUT* perl ~/usr_scripts/sample_db.pl inv.parsed.ng.rr.uc 4 1 sample_db_OUT $taxkey # group the sample into homologs, by blast all against all followed by marcov clustering # 2 per = 1091 printed; 1 per = :549 makeblastdb-2.2.28+-64bit -in sample_db_OUT -dbtype nucl -parse_seqids blastn-2.2.28+-64bit -task blastn -db sample_db_OUT -query sample_db_OUT -out sample_db_OUT.AAAblastout -dust no -strand both -evalue 1e-10 -max_target_seqs 900000 -outfmt '6 qseqid sseqid qlen slen length pident' perl ~/usr_scripts/blastout_to_mcl.pl sample_db_OUT.AAAblastout mcxload -abc sample_db_OUT.AAAblastout.mcl_in --stream-mirror -o sample_db_OUT.AAAblastout.mcl_in_binary --write-binary -write-tab seq.dict mcl sample_db_OUT.AAAblastout.mcl_in_binary -o sample_db_OUT.AAAblastout.mcl_out -scheme 1 -I 1.4 mcxdump -imx sample_db_OUT.AAAblastout.mcl_out -o sample_db_OUT.AAAblastout.mcl_clusters -tabr seq.dict --dump-rlines --no-values perl ~/usr_scripts/extract_mcl_clusters.pl sample_db_OUT sample_db_OUT.AAAblastout.mcl_clusters # get the names of each sequence # do this for genes 2:10 or whatever rm FGN_screenout numnbers=(2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20) for i in ${!numnbers[*]} do current_file=${numnbers[$i]} echo current file $i is $current_file perl ~/usr_scripts/fetch_gene_names.pl sample_db_OUT.AAAblastout.mcl_clusters.cl$current_file 800 >> FGN_screenout done # manually remove differentky named sequesnece, these are identified in FGN_screenout, # then rename files with the gene names # also, you need to show hidden files in the folder or it will try align: ...sters.cl2~.clo # using these:12S 16S 18S 28S CYTB EF1a RP2 wnt # ALIGN SPECIES DENSE PROFILE SEQS # needs more careful alignment than loop with mafft # check alignments with bioedit and trim. # 18S: # 18s looks crap , use blastlaign check if they are actually homologs. # seems a few are not, maybe due to a couple of steps of partial overlap in similarity clustering /usr/local/bin/mafft --localpair --maxiterate 100 sample_db_OUT.AAAblastout.mcl_clusters.18S > sample_db_OUT.AAAblastout.mcl_clusters.18S.mafft clustalo-1.1.0 --infile=sample_db_OUT.AAAblastout.mcl_clusters.18S --outfile=sample_db_OUT.AAAblastout.mcl_clusters.18S.clo --outfmt=fa --threads=2 --force --log=clustalo_logfile fsa --fast sample_db_OUT.AAAblastout.mcl_clusters.18S > sample_db_OUT.AAAblastout.mcl_clusters.18S.fsa perl ~/usr_scripts/change_tobycode.pl sample_db_OUT.AAAblastout.mcl_clusters.18S perl ~/software/blastalign/BlastAlign2 -i sample_db_OUT.AAAblastout.mcl_clusters.18S.recoded -s 500 perl ~/usr_scripts/format_conversion.pl sample_db_OUT.AAAblastout.mcl_clusters.18S.recoded.phy sample_db_OUT.AAAblastout.mcl_clusters.18S.recoded.fas phylip fasta perl ~/usr_scripts/insert_tobycode_to_fastafile.pl sample_db_OUT.AAAblastout.mcl_clusters.18S.recoded.fas sample_db_OUT.AAAblastout.mcl_clusters.18S.recode_key sample_db_OUT.AAAblastout.mcl_clusters.18S.blastalign # tried a few alignments and checked them, parnk is preferble # prank guide tree step is very slow, use raxml tree from clo alignment as the guide tree: perl ~/usr_scripts/format_conversion.pl sample_db_OUT.AAAblastout.mcl_clusters.18S.clo sample_db_OUT.AAAblastout.mcl_clusters.18S.clo.phy fasta phylip # wrong command, you only need RAxML_result, not thorough optimization raxmlHPC-7.2.8-ALPHA -D -e 1.0 -c 4 -s sample_db_OUT.AAAblastout.mcl_clusters.18S.clo.phy -n sample_db_OUT.AAAblastout.mcl_clusters.18S.clo_tree -m GTRCAT -p 12345 prank -d=sample_db_OUT.AAAblastout.mcl_clusters.18S -o=sample_db_OUT.AAAblastout.mcl_clusters.18S.prank -t=RAxML_result.sample_db_OUT.AAAblastout.mcl_clusters.18S.clo_tree -DNA # the un tax added file above,. need to manually trimmed and saved as: sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed # need to remove some overwhelmingly gappy columns: #$trim_terminal_gaps_only = 0;$column_removal_cutoff(0.98 perl ~/usr_scripts/alignment_processing/rm_alignment_columns.pl sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed # 28S: clustalo-1.1.0 --infile=sample_db_OUT.AAAblastout.mcl_clusters.28S --outfile=sample_db_OUT.AAAblastout.mcl_clusters.28S.clo --outfmt=fa --threads=2 --force --log=clustalo_logfile /usr/local/bin/mafft --localpair --maxiterate 100 sample_db_OUT.AAAblastout.mcl_clusters.28S > sample_db_OUT.AAAblastout.mcl_clusters.28S.mafft fsa --fast sample_db_OUT.AAAblastout.mcl_clusters.28S > sample_db_OUT.AAAblastout.mcl_clusters.28S.fsa perl ~/usr_scripts/change_tobycode.pl sample_db_OUT.AAAblastout.mcl_clusters.28S perl ~/software/blastalign/BlastAlign2 -i sample_db_OUT.AAAblastout.mcl_clusters.28S.recoded -s 500 perl ~/usr_scripts/format_conversion.pl sample_db_OUT.AAAblastout.mcl_clusters.28S.recoded.phy sample_db_OUT.AAAblastout.mcl_clusters.28S.recoded.fas phylip fasta perl ~/usr_scripts/insert_tobycode_to_fastafile.pl sample_db_OUT.AAAblastout.mcl_clusters.28S.recoded.fas sample_db_OUT.AAAblastout.mcl_clusters.28S.recode_key sample_db_OUT.AAAblastout.mcl_clusters.28S.blastalign #$trim_terminal_gaps_only = 0;$column_removal_cutoff(0.98 perl ~/usr_scripts/alignment_processing/rm_alignment_columns.pl sample_db_OUT.AAAblastout.mcl_clusters.28S.blastalign perl ~/usr_scripts/format_conversion.pl sample_db_OUT.AAAblastout.mcl_clusters.28S.clo sample_db_OUT.AAAblastout.mcl_clusters.28S.clo.phy fasta phylip #raxmlHPC-7.2.8-ALPHA -D -e 1.0 -c 4 -s sample_db_OUT.AAAblastout.mcl_clusters.28S.clo.phy -n sample_db_OUT.AAAblastout.mcl_clusters.28S.clo_tree -m GTRCAT -p 12345 raxmlHPC-8.1.2 -F -s sample_db_OUT.AAAblastout.mcl_clusters.28S.clo.phy -n sample_db_OUT.AAAblastout.mcl_clusters.28S.clo_tree -m GTRCAT -c 4 -p 12345 prank -d=sample_db_OUT.AAAblastout.mcl_clusters.28S -o=sample_db_OUT.AAAblastout.mcl_clusters.28S.prank -t=RAxML_result.sample_db_OUT.AAAblastout.mcl_clusters.28S.clo_tree -DNA # crashes # 16S: clustalo-1.1.0 --infile=sample_db_OUT.AAAblastout.mcl_clusters.16S --outfile=sample_db_OUT.AAAblastout.mcl_clusters.16S.clo --outfmt=fa --threads=2 --force --log=clustalo_logfile /usr/local/bin/mafft --localpair --maxiterate 100 sample_db_OUT.AAAblastout.mcl_clusters.16S > sample_db_OUT.AAAblastout.mcl_clusters.16S.mafft perl ~/usr_scripts/format_conversion.pl sample_db_OUT.AAAblastout.mcl_clusters.16S.clo sample_db_OUT.AAAblastout.mcl_clusters.16S.clo.phy fasta phylip raxmlHPC-7.2.8-ALPHA -D -e 1.0 -c 4 -s sample_db_OUT.AAAblastout.mcl_clusters.16S.clo.phy -n sample_db_OUT.AAAblastout.mcl_clusters.16S.clo_tree -m GTRCAT -p 12345 prank -d=sample_db_OUT.AAAblastout.mcl_clusters.16S -o=sample_db_OUT.AAAblastout.mcl_clusters.16S.prank -t=RAxML_result.sample_db_OUT.AAAblastout.mcl_clusters.16S.clo_tree -DNA # also crashes. # 12S clustalo-1.1.0 --infile=sample_db_OUT.AAAblastout.mcl_clusters.12S --outfile=sample_db_OUT.AAAblastout.mcl_clusters.12S.clo --outfmt=fa --threads=2 --force --log=clustalo_logfile /usr/local/bin/mafft --localpair --maxiterate 100 sample_db_OUT.AAAblastout.mcl_clusters.12S > sample_db_OUT.AAAblastout.mcl_clusters.12S.mafft prank -d=sample_db_OUT.AAAblastout.mcl_clusters.16S -o=sample_db_OUT.AAAblastout.mcl_clusters.16S.prank -t=RAxML_result.sample_db_OUT.AAAblastout.mcl_clusters.16S.clo_tree -DNA # CYTB clustalo-1.1.0 --infile=sample_db_OUT.AAAblastout.mcl_clusters.CYTB --outfile=sample_db_OUT.AAAblastout.mcl_clusters.CYTB.clo --outfmt=fa --threads=2 --force --log=clustalo_logfile /usr/local/bin/mafft --localpair --maxiterate 100 sample_db_OUT.AAAblastout.mcl_clusters.CYTB > sample_db_OUT.AAAblastout.mcl_clusters.CYTB.mafft # EF1a clustalo-1.1.0 --infile=sample_db_OUT.AAAblastout.mcl_clusters.EF1a --outfile=sample_db_OUT.AAAblastout.mcl_clusters.EF1a.clo --outfmt=fa --threads=2 --force --log=clustalo_logfile /usr/local/bin/mafft --localpair --maxiterate 100 sample_db_OUT.AAAblastout.mcl_clusters.EF1a > sample_db_OUT.AAAblastout.mcl_clusters.EF1a.mafft # RP2 clustalo-1.1.0 --infile=sample_db_OUT.AAAblastout.mcl_clusters.RP2 --outfile=sample_db_OUT.AAAblastout.mcl_clusters.RP2.clo --outfmt=fa --threads=2 --force --log=clustalo_logfile /usr/local/bin/mafft --localpair --maxiterate 100 sample_db_OUT.AAAblastout.mcl_clusters.RP2 > sample_db_OUT.AAAblastout.mcl_clusters.RP2.mafft # wnt clustalo-1.1.0 --infile=sample_db_OUT.AAAblastout.mcl_clusters.wnt --outfile=sample_db_OUT.AAAblastout.mcl_clusters.wnt.clo --outfmt=fa --threads=2 --force --log=clustalo_logfile /usr/local/bin/mafft --localpair --maxiterate 100 sample_db_OUT.AAAblastout.mcl_clusters.wnt > sample_db_OUT.AAAblastout.mcl_clusters.wnt.mafft #28s bad; 16s ok; 12s mafft ok; cytb mafft good; ef1a mafft good; rp2 mafft good; wnt mafft good # ok, aligmment at this level is very awkward .... # save time, use the 3 i have already, and add the extra 6 ... # sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed.colrm # sample_db_OUT.AAAblastout.mcl_clusters.16S.mafft.colrm # sample_db_OUT.AAAblastout.mcl_clusters.12S.mafft # sample_db_OUT.AAAblastout.mcl_clusters.EF1a.mafft # sample_db_OUT.AAAblastout.mcl_clusters.RP2.mafft # sample_db_OUT.AAAblastout.mcl_clusters.wnt.mafft #$trim_terminal_gaps_only = 0;$column_removal_cutoff(0.98 perl ~/usr_scripts/alignment_processing/rm_alignment_columns.pl sample_db_OUT.AAAblastout.mcl_clusters.12S.mafft all processed and checked: sample_db_OUT.AAAblastout.mcl_clusters.12S.mafft.processed sample_db_OUT.AAAblastout.mcl_clusters.16S.mafft.colrm.processed sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed.colrm.fas sample_db_OUT.AAAblastout.mcl_clusters.EF1a.mafft.processed sample_db_OUT.AAAblastout.mcl_clusters.RP2.mafft.processed sample_db_OUT.AAAblastout.mcl_clusters.wnt.mafft.processed 12S 16S 18S EF1a RP2 wnt # got to here in specie dense, NOV 2015 ################################################################################################################ #cd /home/douglas/scripted_analyses/COIevol/database/ cd /home/douglas/scripted_analyses/COIevol/corrections/species_dense/ # cannot login with username # feb 2015 release # 10/20/2015 release # actually seems to be: October 15 2015; NCBI-GenBank Flat File Release 210.0 rm gbinv*.seq.gz # better not to do:wget ftp://ftp.ncbi.nih.gov/genbank/gbinv*.seq.gz, # in case downloading is interupted, easier to restart as below for i in {1..132};do echo "downloading file $i"; wget ftp://ftp.ncbi.nih.gov/genbank/gbinv$i.seq.gz done # unzip, retain names.dmp, nodes.dmp, others can be deleted tar xvzf taxdump.tar.gz #$parse_protein = 0;$output_ID_format = 1; @genbank_divisions "gbinv # 5 hours running time for insecta! might need to re-write this script at some point. rm inv perl ~/usr_scripts/create_fasta_database_from_genbank_flatfiles.pl # Endopterygota = 33392, insects=50557, bilataria=33213; Hexapoda:6960 #$parse_species_only = 1; perl ~/usr_scripts/parse_ncbi_tax_database.pl 50557 # run as to get species named only #$parse_binomial_labelled_only =1;#$id_format = 3;3= full species name, underscore, accession # run this on insects! dont choose higher rank or doing too much # takes 30 minutes ... rm inv.parsed perl ~/usr_scripts/parse_taxon_from_fastafile.pl inv key_Oct2013_Insecta rm inv # huge file # run twice, lower limit 200, upper 32000. # primate cytb are about 410 bp. one of the 28s profile is short:210. rm inv.parsed.ng inv.parsed.ng.rr perl ~/usr_scripts/preprocess_fasta_database.pl inv.parsed rm inv.parsed.genome inv.parsed.ng inv.parsed.ng.sorted inv.parsed.ng.sorted.nr inv.parsed.ng.uclust_log # make COI profile: # previous profile i had was a sparse one, their paper has a very dense insect profile availalbe, # described as profile 2 (appendix B) in the paper, including eight of the largest orders of insects. # they also include 'test taxa'! these should be ommited when making the tree .... # seems all strepsiptera have 1 or 2 AA deletion in COI # endop PROFILE sequences were taken from the hebert insect profile # and attitoinal barcodes were download for represettatives of orders not included. # align them. (just once) ~ 4 deleted for being a bit short, or bad quality cd /home/douglas/usr_files/ # mafft uses too much memory /usr/local/bin/mafft --localpair --maxiterate 100 ENDOcdP_COIprofile.unaligned > ENDOP_COIprofile.mafft # works fine: muscle -in ENDOP_COIprofile.unaligned -out ENDOP_COIprofile.unaligned.muscle -maxiters 2 -diags1 -sv -maxmb 800 -gapopen -800 -gapextend -40 cp /home/douglas/usr_files/ENDOP_COIprofile.muscle /home/douglas/scripted_analyses/COIevol/analysis/ENDOP_COIprofile.muscle # too much memory demand and time using the above coi profile, reduce .... # should be 6 seqs per major order, 3 per minor order. cd /home/douglas/scripted_analyses/COIevol/analysis/ /home/douglas/scripted_analyses/COIevol/analysis/ENDOP_COIprofile.muscle.reduced #perl ~/usr_scripts/format_conversion.pl ENDOP_COIprofile.muscle ENDOP_COIprofile.muscle.phy fasta phylip # **** NOV 2015 **** cd /home/douglas/usr_files/ all insecta prfiel sequences, 100: /home/douglas/usr_files/hebert2003_InsectaProfileOnly muscle -in hebert2003_InsectaProfileOnly -out H03InsProf.muscle -maxiters 2 -diags1 -sv -maxmb 800 -gapopen -800 -gapextend -40 # 100 profile (only) seqs opened in bioedit, # trimmed to 624 bp, ~5 rm, 2 badly aligned, 1 short, 2 with single base deletion. # Blast COI, computationally demanding. #cd /home/douglas/scripted_analyses/COIevol/analysis/ cd /home/douglas/scripted_analyses/COIevol/corrections/species_dense/ # takes too much memory, split into 3 db's using makeblastdb # although not too slow, ~1 hour for all 3 rm inv.parsed.ng.rr.* makeblastdb-2.2.28+-64bit -in inv.parsed.ng.rr -dbtype nucl -parse_seqids -max_file_sz 60000000 # -perc_identity 50 might be too liberal. # if you blast an endop order with coi from different endop order, it will be about PI=80 coi_profile=H03InsProf.muscle.fas # old endop one:ENDOP_COIprofile.muscle rm invCOI.blastout invCOI.blastout00 invCOI.blastout01 invCOI.blastout02 blastn-2.2.28+-64bit -task blastn -db inv.parsed.ng.rr.00 -query $coi_profile -out invCOI.blastout00 -word_size 10 -perc_identity 60 -dust no -strand both -evalue 1e-10 -num_threads 1 -max_target_seqs 200000 -outfmt '6 qseqid sseqid evalue pident length sstart send qframe sframe' blastn-2.2.28+-64bit -task blastn -db inv.parsed.ng.rr.01 -query $coi_profile -out invCOI.blastout01 -word_size 10 -perc_identity 60 -dust no -strand both -evalue 1e-10 -num_threads 1 -max_target_seqs 200000 -outfmt '6 qseqid sseqid evalue pident length sstart send qframe sframe' blastn-2.2.28+-64bit -task blastn -db inv.parsed.ng.rr.02 -query $coi_profile -out invCOI.blastout02 -word_size 10 -perc_identity 60 -dust no -strand both -evalue 1e-10 -num_threads 1 -max_target_seqs 200000 -outfmt '6 qseqid sseqid evalue pident length sstart send qframe sframe' rm invCOI.blastout cat invCOI.blastout00 invCOI.blastout01 invCOI.blastout02 >invCOI.blastout # parse blast output, takes over 2 hours # $extract_using_gi = 0; $dbtype = "nucl" rm invCOI.blastout.retreived perl ~/usr_scripts/parse_hits.pl inv.parsed.ng.rr invCOI.blastout 60 1 500 2 blastdbcmd-2.2.28+-64bit # NOV15:224378 retreived ############################### # SKIP this: # remove heberts 'test' sequences # do this before species filtering, if there conspecfics then these need keeping for test #/home/douglas/scripted_analyses/COIevol/files/hebert_endop_test_gi_numbers #$remove_or_retain = 1 # hcak these lines. #my $current_idCOPY = $current_id; #$current_idCOPY =~ s/.+_(\d+)$/$1/; #if($rm_these{$current_idCOPY} == 1) {$rm_these{$current_id}=1}; # only 25 . hardly seems worth it., plus, it wouldnt suprise me if one i removed turned out to be in the mtgenome tree. #perl /home/douglas/usr_scripts/remove_fasta_entries.pl invCOI.blastout.retreived /home/douglas/scripted_analyses/COIevol/files/hebert_endop_test_gi_numbers ############################### # NOV15: still using the existing CYTB and 28s profiles, to save time # align CYTB profile and Blast. a lot quicker. # primate cytb are about 410 bp. rm cytb.blastout #/usr/local/bin/mafft --localpair --maxiterate 100 primate_diet_cytb > primate_diet_cytb.mafft cytb_queries=/home/douglas/scripted_analyses/COIevol/analysis/primate_diet_cytb.mafft blastn-2.2.28+-64bit -task blastn -db inv.parsed.ng.rr -query $cytb_queries -out cytb.blastout -word_size 10 -perc_identity 40 -dust no -strand both -evalue 1e-6 -num_threads 1 -max_target_seqs 200000 -outfmt '6 qseqid sseqid evalue pident length sstart send qframe sframe' rm cytb.blastout.retreived perl ~/usr_scripts/parse_hits.pl inv.parsed.ng.rr cytb.blastout 60 1 300 2 blastdbcmd-2.2.28+-64bit # remove the cytbs from the mammal diet study... they are only labeled arthropod environ sample, so shouldnt be in ################ # 28s profile # to make this i used the 28s from QYs XSBN bees, just the first sequence in the file (1125930_S-0717C207.D2-3549F), # then did manual blast of each endopterygota order, and downloaded one sequence each. # one of the 28s profile is short:210. #/usr/local/bin/mafft --localpair --maxiterate 100 /home/douglas/scripted_analyses/COIevol/files/insect28sProfile_unaligned > 28s_InsectsProfile.mafft rm 28S.blastout 28S.blastout.retreived cp /home/douglas/scripted_analyses/COIevol/analysis/28s_InsectsProfile.mafft 28s_queries blastn-2.2.28+-64bit -task blastn -db inv.parsed.ng.rr -query 28s_queries -out 28S.blastout -word_size 10 -perc_identity 40 -dust no -strand both -evalue 1e-6 -num_threads 1 -max_target_seqs 200000 -outfmt '6 qseqid sseqid evalue pident length sstart send qframe sframe' perl ~/usr_scripts/parse_hits.pl inv.parsed.ng.rr 28S.blastout 60 1 200 2 blastdbcmd-2.2.28+-64bit # i made consensus sequence from the 28s profile, was suspicously short, # this is after great difficulty aligning 28s with the original profile, # so, i will make a new profile, see if there is more luck, # will start with the seqs from old 'strepsiptera problem' paper in SB. #$how_many_to_print = 1000; $case_sensitive = 0 # cant log onto gb at the mo, search local files ... rm *.parsed1 for i in {1..132};do echo "$i"; gunzip gbinv$i.seq.gz perl ~/usr_scripts/fetch_entries_from_genbank_flatfile.pl gbinv$i.seq gbinv$i.parsed1 'strepsiptera problem' [ -s gbinv$i.parsed1 ] || rm gbinv$i.parsed1 gzip gbinv$i.seq done ################## # new markers, #12s=250; 16s=360; 18s=800; ef1a=450; rp2=700; wnt=350 phylo_markers=(sample_db_OUT.AAAblastout.mcl_clusters.12S.mafft.processed sample_db_OUT.AAAblastout.mcl_clusters.16S.mafft.colrm.processed sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed.colrm.fas sample_db_OUT.AAAblastout.mcl_clusters.EF1a.mafft.processed sample_db_OUT.AAAblastout.mcl_clusters.RP2.mafft.processed sample_db_OUT.AAAblastout.mcl_clusters.wnt.mafft.processed) length_cutoffs=(250 360 800 450 700 350) for i in ${!phylo_markers[*]} do current_file=${phylo_markers[$i]};current_cutoff=${length_cutoffs[$i]}; echo file $i is $current_file cutoff $current_cutoff rm $current_file.blastout $current_file.blastout.retreived blastn-2.2.28+-64bit -task blastn -db inv.parsed.ng.rr -query $current_file -out $current_file.blastout -word_size 10 -perc_identity 40 -dust no -strand both -evalue 1e-6 -num_threads 1 -max_target_seqs 200000 -outfmt '6 qseqid sseqid evalue pident length sstart send qframe sframe' perl ~/usr_scripts/parse_hits.pl inv.parsed.ng.rr $current_file.blastout 60 1 $current_cutoff 2 blastdbcmd-2.2.28+-64bit done ################## # species filtering #$identified_species_only = 1; $trim_off_accession = 0; rm *.blastout.retreived.ID_filtered all_markers=(invCOI 28S cytb sample_db_OUT.AAAblastout.mcl_clusters.12S.mafft.processed sample_db_OUT.AAAblastout.mcl_clusters.16S.mafft.colrm.processed sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed.colrm.fas sample_db_OUT.AAAblastout.mcl_clusters.EF1a.mafft.processed sample_db_OUT.AAAblastout.mcl_clusters.RP2.mafft.processed sample_db_OUT.AAAblastout.mcl_clusters.wnt.mafft.processed) all_markers=(sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed.colrm.fas) for i in ${!all_markers[*]};do current_file=${all_markers[$i]}; echo file $i is $current_file; perl ~/usr_scripts/species_filter.pl $current_file.blastout.retreived 1; done # 123465 seqs in your input file. #33164 printed_to_output # 11mar: # coi,155457 seqs, 38819 printed. cytb: 13360 seqs, 5253 printed. 28s: 14510 seqs, 10904 printed # NOV15: # BLast parse log?: # invCOI,223846,49791; cytb,6682067,18782; 28s,210975,17802; # 12s,419016,10994 ; 16s,6070012,26945 ; 18s,5672793,9768 # ef1a,3287285,17247 ; rp2,76410,1349 ; wnt,556009,11447 # species filter log: # invCOI,223846,49791; 28S,17803,13401 ; cytb, 18783,7268; 12S,10995,7511, # 16S, 26946, 16319 ; 18S,9769, 8195 ; EF1a, 17248, 11013 ; RP2, 1350,950 ; wnt, 11448, 7953 ; # leaf labels in their tree are non-standrard, # so get genus and species names, and find family name for each. ##perl ~/scripted_analyses/COIevol/scripts/parse_misof_tree.pl /home/douglas/scripted_analyses/COIevol/misof_data/supermatrixB_aa_Threshold-75-ConsensusTree.consensus75_pruned.tre invCOI.blastout.retreived.ID_filtered # before alignment, INSERT OUTGROUPs # actually this probably doesnt need doing for each gene, # in fact, just get a COI, it goes into file invCOI.blastout.retreived.ID_filtered, # blast one of these in this profile file, against genbank # /home/douglas/scripted_analyses/COIevol/corrections/species_dense/H03InsProf.muscle.fas # Collembola; Entomobryomorpha; Isotomoidea; Isotomidae; Anurophorinae; Cryptopygus antarcticus # Diplura; Dicellurata; Japygoidea; Japygidae; Japyginae; Occasjapyx japonicus # Protura; Acerentomata; Acerentomidae; Acerentomon microrhinus gb|GQ215461.1|:45-615 Cryptopygus antarcticus (Collembola) gb|JN990600.1|:1452-2075 Occasjapyx japonicus (Diplura) gb|JQ728012.1|:1933-2550 Acerentomon microrhinus (Protura) >gb|GQ215461.1|:45-615 Cryptopygus antarcticus antarcticus haplotype H62 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial CATTTAGAATATTAATCCGCTTAGAGTTAGGACAACCTGGGTCATTTATTGGGGACGACCAAATTTACAACGTAATAGTGACCGCGCATGCTTTCGTAATAATTTTCTTCATAGTTATGCCAATTATAATTGGGGGGTTTGGTAATTGACTAATCCCTTTAATAATTGGAGCCCCCGATATAGCTTTCCCCCGAATAAATAATATAAGTTTTTGACTTCTACCTCCATCTCTTATTTTATTATTATCAGGAGGGTTAGTTGAAAGAGGGGCCGGGACTGGATGGACAGTTTACCCCCCTCTTTCTGCCGGTATTGCTCACGCTGGGGCGTCAGTAGATCTTTCAATTTTTAGTCTACATCTAGCTGGGGCGTCTTCAATTCTAGGTGCCGTAAATTTTATTACTACTATTATTAATATACGTTCGTCTGGTATAACATGAGATCGCACACCATTATTTGTATGGTCTGTATTTTTAACCGCAATTTTACTTTTACTTTCTCTCCCAGTATTAGCAGGGGCTATTACTATACTTCTCACCGATCGTAACTTAAATACATCTTTTTTCGACCC >gb|JN990600.1|:1452-2075 Occasjapyx japonicus mitochondrion, complete genome GGGGCCTGATCTGCCATATTAGGAACTGCCCTTAGAATGCTAATCCGGGCCGAACTAGGTCAACCGGGAAGTCTCATTGGAGATGACCAAATCTACAACGTAATTGTAACAGCCCACGCCTTCATCATAATTTTCTTTATGGTAATGCCCATTATAATTGGAGGGTTCGGAAATTGACTTGTTCCACTAATGTTAGGTGCCCCCGATATGGCCTTCCCACGGCTAAACAACATAAGATTCTGACTCCTTCCCCCCTCTCTAACCCTGCTCCTAGCGGGGAGTGCTGTGGAAAATGGAGCCGGAACTGGATGAACAGTCTATCCTCCCCTCGCTTCCAATATTGCCCATGCCGGAGCTTCCGTAGACCTAACCATTTTCTCCCTTCATTTAGCTGGGGCATCCTCAATCCTAGGAGCCATTAACTTCATTACCACTGTTATTAATATACGAACACAGGGAATAACCATAGAACGAATGCCACTATTTGTATGAGCGGTCTTCATTACAGCCTTCCTACTACTATTATCACTTCCCGTACTGGCCGGTGCCATTACAATACTGCTCACAGACCGTAATTTAAACACTTCATTCTTTGATCCAGCAGGAGGGGGAGACCCCATTCTA >gb|JQ728012.1|:1933-2550 Acerentomon microrhinus mitochondrion, complete genome TGATCTGGAATATTAGGTCTTTCCATAAGTATTTTAATTCGGAGAGAGCTTTCTAGCCCTGGTAGTTTAATTAGAGATGATCATATTTTTAATGTTGTGGTTACCTCACATGCGTTAATTATAATTTTTTTTATGGTGATGCCTATTTTAATTGGAGGGTTTGGGAATTGGTTGATCCCTTTAATGCTAGGTAGCCCAGACATAGCCTTCCCTCGGATAAATAATTTAAGATTTTGATTATTGCCTCCATCTTTATTTATACTATTAATAAGAAGAATAGTTGAGTCTGGAGTTGGGGCAGGATGGACCTTGTACCCTCCTCTTTCTGATAAATTAAGTCATTCTGGGGGTTCGGTAGATTTGTCAATTTTTTCTTTACATTTGGCCGGGGCTTCTTCTATTTTGGGGGCTATTAATTTTATTACAACAATTTCTAATATACGAAGATTTAATTTAAAATGATTAAAAATTACATTATTTTCTTGATCAGTTTTTATTACTGCTTTTTTGCTTCTTTTATCTTTACCTGTATTAGCTGGGGCTATTACTATATTGTTAATAGACCGAAATATAAATTCTTCTTTTTTTAACCCAAGAGGTGGGGGGGATCCTGTATTA #clustalo-1.2.0 --infile=invCOI.blastout.retreived.ID_filtered.ord --outfile=invCOI.blastout.retreived.ID_filtered.ord.clo --outfmt=fa --threads=4 --force --log=clustalo_logfile #clustalo-1.1.0 --profile1=hbp.clo --in=invCOI.blastout.retreived.ID_filtered.ord -outfile=clustalout --seqtype=DNA --outfmt=fa --threads=2 --log=clustallogfile # single clarey COI. clustalO converts profile into hmm, # so better to use an alignment rather than single seq. plus you dont want to consider the non-barcode seq. # it adds to computation, memory, possilbility of misalignment. #clustalo-1.1.0 --profile1=coi --in=invCOI.blastout.retreived.ID_filtered.half -outfile=clout --seqtype=DNA --outfmt=fa --threads=2 --log=clustallogfile #clustalo-1.1.0 --profile1=ENDOP_COIprofile.muscle --in=invCOI.blastout.retreived.ID_filtered.half -outfile=clout --seqtype=DNA --outfmt=fa --threads=2 --log=clustallogfile # alignment still sucks. perl ~/usr_scripts/format_conversion.pl clout clout.fs fasta fasta # crashes: tried template hbp.clo, ENDOP_COIprofile.muscle. seems to suggest the aligned candidate is too short relative to new ones. clarey coi # ********************* # go further below, look for box ... # so we go with this one for COI, # its mostly fine, no obvious misalignment, just some sequence jump out the main fragment.... and quick, and low memory pynast --pairwise_alignment_method muscle -i invCOI.blastout.retreived.ID_filtered -t coi --min_pct_id=60 --min_len=500 --fasta_out_fp=py3 # ******************** # takes no extra memory. the 5 prime end a bit of sequence is lost on many taxa, # but i dont care, just trim 20 bp in or something .... as long as no blatent errors (which you get with others) # 2nd method to align reads to reference alignment: # *** took just seconds to run, and absolutly perfect! mar2015: not so fast, actually theres big error. # just swap '.' for '-' in output file # search=kmer,suffix, blast, align=needleman, gotoh. # sina, fails on most sequences. #convert it aligned fasta references to arb format like this: #/home/douglas/software/sina/sina-1.2.11/sina -i ENDOP_COIprofile.muscle -o ENDOP_COIprofile.muscle.arb --prealigned --intype fasta #then align: #/home/douglas/software/sina/sina-1.2.11/sina -i invCOI.blastout.retreived.ID_filtered -o invCOI.blastout.retreived.ID_filtered.sina --ptdb ENDOP_COIprofile.muscle.arb --intype fasta --outtype fasta # align COI # earlier profile (template) was hbp.clo, worked great. but try ENDOP_COIprofile.muscle.reduced # complains if candidates are longer than templates... maybe discards those? not many though. # bigger template has loads of frameshifts. # i notice # try using mothur with blast, will need to install old blast in the specified path. # last option, could write script to detect ,mothur mistakes, take these out and re-align. # perhaps seperatly align each order then profile align together. # this is preffered if doign seperate alignment, than e.g. random samples # The ksize parameter allows you to specify the kmer size for finding most similar template to candidate. The default is 8. # its v quick, kmer can be reduced # The match parameter allows you to specify the bonus for having the same base. The default is 1.0. # The mistmatch parameter allows you to specify the penalty for having different bases. The default is -1.0. # The gapopen parameter allows you to specify the penalty for opening a gap in an alignment. The default is -2.0. # so try giving greater penalty for gap and mismatches. # match=2.0, mismatch=-2.0, gapopen=-4.0, # misalignment results in pident of about 30, where it should be >~80 rm invCOI.blastout.retreived.align mothur "#align.seqs(candidate=invCOI.blastout.retreived.ID_filtered.half, template=ENDOP_COIprofile.muscle, search=kmer, align=needleman, threshold=0.7)" mothur "#align.seqs(candidate=invCOI.blastout.retreived.ID_filtered.half, template=ENDOP_COIprofile.muscle, match=2.0, mismatch=-2.0, gapopen=-4.0, search=kmer, align=needleman, threshold=0.7)" #It took 149 secs to align 38819 sequences. # try screen.seqs next # align CYTB rm cytb.blastout.retreived.align #mothur "#align.seqs(candidate=cytb.blastout.retreived.ID_filtered, template=primate_diet_cytb.mafft, search=kmer, align=needleman, threshold=0.5)" pynast --pairwise_alignment_method muscle -i cytb.blastout.retreived.ID_filtered -t primate_diet_cytb.mafft --min_pct_id=60 --min_len=300 --fasta_out_fp=cytb.blastout.retreived.align # above crashes, use longer ref alignment. # ************ clustalo-1.1.0 --profile1=primate_diet_cytb.mafft --in=cytb.blastout.retreived.ID_filtered -outfile=cytb.clo --seqtype=DNA --outfmt=fa --threads=2 --log=clustallogfile # ************ # clustalo alignment left side is good, right side is quite varialbe and unalignment at this scale, ok within genera, # so for the purpose of barcoding and tip level phylogenetic information, should be ok. #It took 10 secs to align 5253 sequences. # align 28S #It took 27 secs to align 10904 sequences. ############################################ # ALIGNMENT HERE! #COI mafft is ludicrously gappy #COI 'align' software contains many single 'complete-frameshifted' seqs pynast --pairwise_alignment_method muscle -i invCOI.blastout.retreived.ID_filtered -t /home/douglas/scripted_analyses/COIevol/analysis/coi --min_pct_id=60 --min_len=500 --fasta_out_fp=insectCOI.pynast perl ~/usr_scripts/format_conversion.pl insectCOI.pynast insectCOI.pynast2 fasta fasta # CYTB # cytb, some nice segments, some error segments, might be suitable for gblocks # CLomega new version, alignment is ok, apart from 3prime end clustalo-1.2.0 --profile1=/home/douglas/scripted_analyses/COIevol/analysis/primate_diet_cytb.mafft --in=cytb.blastout.retreived.ID_filtered -outfile=insectCYTB.clo --guidetree-out=insectCYTB.CLO_guidetree --seqtype=DNA --outfmt=fa --threads=4 --log=clustallogfile # 28S: # 28 S clustal O many micro-misalignments even beween congeners # clO new version, awful in parts. clustalo-1.2.0 --profile1=/home/douglas/scripted_analyses/COIevol/analysis/28s_InsectsProfile.mafft --in=28S.blastout.retreived.ID_filtered -outfile=insect28S.clo --guidetree-out=insect28S.CLO_guidetree --seqtype=DNA --outfmt=fa --threads=4 --log=clustallogfile /home/douglas/software/sina/sina-1.2.11/sina -i /home/douglas/scripted_analyses/COIevol/analysis/28s_InsectsProfile.mafft -o 28s_InsectsProfile.mafft.arb --prealigned --intype fasta /home/douglas/software/sina/sina-1.2.11/sina -i 28S.blastout.retreived.ID_filtered -o 28S.blastout.retreived.ID_filtered.sina --ptdb 28s_InsectsProfile.mafft.arb --intype fasta --outtype fasta rm 28S.blastout.retreived.align mothur "#align.seqs(candidate=28S.blastout.retreived.ID_filtered, template=/home/douglas/scripted_analyses/COIevol/analysis/28s_InsectsProfile.mafft, search=kmer, align=needleman, threshold=0.7)" # pynast crashes pynast --pairwise_alignment_method muscle -i 28S.blastout.retreived.ID_filtered -t /home/douglas/scripted_analyses/COIevol/analysis/28s_InsectsProfile.mafft --min_pct_id=70 --min_len=300 --fasta_out_fp=28S.blastout.retreived.ID_filtered.pynast # very fast:--retree 1 --maxiterate 0 --nofft --parttree #/usr/local/bin/mafft --retree 1 --maxiterate 0 --nofft --parttree --ep 0.123 28S.blastout.retreived.ID_filtered > 28S.blastout.retreived.ID_filtered.mafft # iternative refinment rejected for this many seqs (--maxiterate 10 ) # FFT-NS-2 (--retree 2), recompute after rough alignment, 2step progressive alignm,ents: rm 28S.blastout.retreived.ID_filtered.mafft /usr/local/bin/mafft --retree 2 --legacygappenalty 28S.blastout.retreived.ID_filtered > 28S.blastout.retreived.ID_filtered.mafft perl ~/usr_scripts/alignment_processing/rm_alignment_columns.pl 28S.blastout.retreived.ID_filtered.mafft # rubbish. actually looks more like two different alignments in same file perl ~/usr_scripts/consensus_sequence_fasta.pl /home/douglas/scripted_analyses/COIevol/analysis/28s_InsectsProfile.mafft 28s_InsectsProfile.mafft.consensus perl ~/usr_scripts/change_tobycode.pl 28S.blastout.retreived.ID_filtered cat 28S.blastout.retreived.ID_filtered.recoded 28s_InsectsProfile.mafft.consensus > blastalign_in perl ~/software/blastalign/BlastAlign2 -i blastalign_in -r consensus perl ~/usr_scripts/format_conversion.pl 28S.blastout.retreived.ID_filtered.recoded.phy 28S.blastout.retreived.ID_filtered.recoded.fas phylip fasta perl ~/usr_scripts/insert_tobycode_to_fastafile.pl 28S.blastout.retreived.ID_filtered.recoded.fas 28S.blastout.retreived.ID_filtered.recode_key 28S.blastout.retreived.ID_filtered.blastalign # curiously, blastalign from MRS or consensus, is indistinguishable # other marker with cl Om # 12s not so bad, random aligmments usually fall between taxonomic groups # 12s ATsuperrich, ok within group, random alidnged between groups # 16s not so bad, although very quick look # ef1a seems ok after quick look # wnt seems ok within groups new_markers=(sample_db_OUT.AAAblastout.mcl_clusters.12S.mafft.processed sample_db_OUT.AAAblastout.mcl_clusters.16S.mafft.colrm.processed sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed.colrm.fas sample_db_OUT.AAAblastout.mcl_clusters.EF1a.mafft.processed sample_db_OUT.AAAblastout.mcl_clusters.RP2.mafft.processed sample_db_OUT.AAAblastout.mcl_clusters.wnt.mafft.processed) for i in ${!new_markers[*]};do current_file=${new_markers[$i]}; echo file $i is $current_file; clustalo-1.2.0 --profile1=$current_file --in=$current_file.blastout.retreived.ID_filtered -outfile=$current_file.blastout.retreived.ID_filtered.clo --guidetree-out=$current_file.CLO_guidetree --seqtype=DNA --outfmt=fa --threads=4 --log=clustallogfile done # 18s, try mafft # 18s, some bad erros in there, ever between congeners # 18s mafft col rm is passable, no horrible misalignments found. rm sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed.colrm.fas.blastout.retreived.ID_filtered.mafft /usr/local/bin/mafft --retree 2 --legacygappenalty sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed.colrm.fas.blastout.retreived.ID_filtered > sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed.colrm.fas.blastout.retreived.ID_filtered.mafft # $column_removal_cutoff(0.99) perl ~/usr_scripts/alignment_processing/rm_alignment_columns.pl sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed.colrm.fas.blastout.retreived.ID_filtered.mafft # passed! # try several on COI: rm invCOI.blastout.retreived.ID_filtered.CLO_out invCOI.blastout.retreived.ID_filtered.mafft invCOI.blastout.retreived.align clustalo-1.2.0 --profile1=H03InsProf.muscle.fas --in=invCOI.blastout.retreived.ID_filtered -outfile=invCOI.blastout.retreived.ID_filtered.CLO_out --seqtype=DNA --outfmt=fa --threads=4 --guidetree-out=invCOI.blastout.retreived.ID_filtered.CLO_guidetree --log=insCOI.CLO_logfile mothur "#align.seqs(candidate=invCOI.blastout.retreived.ID_filtered, template=H03InsProf.muscle.fas, match=2.0, mismatch=-2.0, gapopen=-4.0, search=kmer, align=needleman, threshold=0.7)" /usr/local/bin/mafft --retree 1 --maxiterate 0 --nofft --parttree --ep 0.123 invCOI.blastout.retreived.ID_filtered > invCOI.blastout.retreived.ID_filtered.mafft # this is just to assist checking alignment # $filter_level = 1; $fasta_ID_format =2 new_alignments=(insectCYTB.clo insect28S.clo sample_db_OUT.AAAblastout.mcl_clusters.12S.mafft.processed.blastout.retreived.ID_filtered.clo sample_db_OUT.AAAblastout.mcl_clusters.16S.mafft.colrm.processed.blastout.retreived.ID_filtered.clo sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed.colrm.fas.blastout.retreived.ID_filtered.clo sample_db_OUT.AAAblastout.mcl_clusters.EF1a.mafft.processed.blastout.retreived.ID_filtered.clo sample_db_OUT.AAAblastout.mcl_clusters.wnt.mafft.processed.blastout.retreived.ID_filtered.clo invCOI.blastout.retreived.ID_filtered.CLO_out invCOI.blastout.retreived.align invCOI.blastout.retreived.ID_filtered.mafft) perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile 28S.blastout.retreived.ID_filtered.consensus_blastalign -output 28S.blastout.retreived.ID_filtered.consensus_blastalign.taxlabs -node 6960 -fasta # $infile; $reference_codon_alignment;$muscle_profile_input;$codon_to_keep perl /home/douglas/usr_scripts/assign_codon_and_domain_positions.pl insectCOI.pynast2 /home/douglas/usr_files/coi_complete_codon_alignment output 123 # I2 Domain of COI. cp sample_db_OUT.AAAblastout.mcl_clusters.18S.prank.1.fas.trimmed.colrm.fas.blastout.retreived.ID_filtered.mafft insect18S.mafft cp sample_db_OUT.AAAblastout.mcl_clusters.12S.mafft.processed.blastout.retreived.ID_filtered.clo insect12S.clo cp sample_db_OUT.AAAblastout.mcl_clusters.16S.mafft.colrm.processed.blastout.retreived.ID_filtered.clo insect16S.clo cp sample_db_OUT.AAAblastout.mcl_clusters.EF1a.mafft.processed.blastout.retreived.ID_filtered.clo insectEF1a.clo cp sample_db_OUT.AAAblastout.mcl_clusters.wnt.mafft.processed.blastout.retreived.ID_filtered.clo insectWNT.clo cp 28S.blastout.retreived.ID_filtered.consensus_blastalign insect28S.blastalign # clustal omega profile aligmment, puts the profile sequences in the bottom of the file in lowercase, rm these! # RESULTS: alignments=(insectCOI.pynast2 insectCYTB.clo insect28S.blastalign insect12S.clo insect16S.clo insect18S.mafft insectEF1a.clo insectWNT.clo) rm *.taxlabs for i in ${!alignments[*]};do current_file=${alignments[$i]}; echo file $i is $current_file; perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile $current_file -output $current_file.taxlabs -node 6960 -fasta done cd /home/douglas/scripted_analyses/COIevol/corrections/species_dense/ ############################################ # proifle members are in the clustalO outputs, you need to ** DELETE THESE ** manually. # problem seqs (v large bl): Fannia_pusio, Psilochaeta_pampiana # Hylaeus_dilatatus missing from hymenoptera seq file. # ....B, 11Mar:more careful building of barcode dataset to try get overlap with mtgenomes. rm current_supermatrix2 partitionfile2 endop3geneD endop3geneD.partitionfile # remove identical seqs before concatenation, # pain in the ass to return to coi later after the supermatrix was modified. # removing identical seqs. these are problematic during evolutionary placemnet. perl ~/usr_scripts/format_conversion.pl insectCOI.pynast2.fas insectCOI.pynast2.fas.phy fasta phylip raxmlHPC-8.1.2 -f c -s insectCOI.pynast2.fas.phy -m GTRCAT -n insectCOI.Rax1 perl ~/usr_scripts/format_conversion.pl insectCOI.pynast2.fas.phy.reduced insectCOI.pynast2.reducd phylip fasta #$remove_accession = 1;$missing_data_char = "?";# $required_data = "Y???????"; rm charsetfile2 current_supermatrix2 partitionfile2 partition_patterns perl /home/douglas/usr_scripts/concatenate_v2.pl insectCOI.pynast2.reducd insect16S.clo.fas insect28S.blastalign.fas insectEF1a.clo.fas insect18S.mafft.fas insectWNT.clo.fas insect12S.clo.fas insectCYTB.clo.fas mv current_supermatrix2 insect8geneA; mv partitionfile2 insect8geneA.partitionfile; mv charsetfile2 insect8gene.charsetfile; # manualy save screen LOG, useful info in there perl /home/douglas/usr_scripts/taxonomic_report.pl -seqfile insect8geneA -output insect8geneA.tax_report -node 6960 -fasta # OLD: D= supermatrix in which all entries have a COI. E= idenitical coi seqs rm # remove super long brnach taxa? or after treesearch #Schistocera_alutacea Petalura_ingentissima Degeeriella_regalis Fannia_pusio Lucilia_eximia Goniozus_jacintae Nipponaetes_haeussleri # to make an image in the paper, get species/gene presence absense info. perl /home/douglas/scripted_analyses/COIevol/scripts/make_matrix_histogram.pl cd /home/douglas/scripted_analyses/COIevol/corrections/figures/ R t1<-read.table("mmhLOG", header=F, row.names=1) #jpeg("FIG_matrix_Log_transformed.jpg" , width = 500 , height = 450) # , res=300 #tiff(filename = "FIG_matrix_Log_transformed.tiff", width = 3200 , height = 3000, units="px", res=600, compression = "jpeg") tiff(filename = "FIG_matrix.tiff", width = 3200 , height = 3000, units="px", res=600, compression = "jpeg") #png("species_level_tree3.png", 2000, 2000) #barplot(sort(t1[,1], 1:length(t1[,1]),decreasing=T), log="xy", xlab = "Locus", ylab = "Number of Species") # cannot truncate, this doesnt really show the boxes appropraitely barplot(sort(t1[,1], 1:length(t1[,1]),decreasing=T), xlab = "Locus", ylab = "Number of Species") dev.off() # NOW HAVE A SUPERMATRIX ######################################################################################################### cd /home/douglas/scripted_analyses/COIevol/corrections/species_dense/ # delete from the supermatrix # the handful of taxa that are absent from the ncbi taxonomy database. # can do at species or genus level ... # looked at one: Napata is scientific name at sp level but synonym for genus perl ~/usr_scripts/rm_seqs_missing_from_taxonomyDB.pl -node 6960 -seqfile insect8geneA -output insect8geneB #fasta IDs present in tax DB:49338; fasta IDs absent in tax DB:20 # species in the alignment are now slightly different, # need to get a COI only alignment for this modified set, or the tree and coi alignment are not gonna match later # split into individual partitions according to charset file raxmlHPC-8.1.2 -f s -s insect8geneB.phy -m GTRCAT -n insectCOI.test -q insect8geneA.partitionfile perl ~/usr_scripts/format_conversion.pl insect8geneB.phy.gene0_insectCOI_pynast2_reducd.phy insect8geneB.COIonly phylip fasta perl ~/usr_scripts/format_conversion.pl insect8geneB.phy.gene1_insect16S_clo_fas.phy insect8geneB.16Sonly phylip fasta perl ~/usr_scripts/format_conversion.pl insect8geneB.phy.gene2_insect28S_blastalign_fas.phy insect8geneB.28Sonly phylip fasta perl ~/usr_scripts/format_conversion.pl insect8geneB.phy.gene3_insectEF1a_clo_fas.phy insect8geneB.EF1aonly phylip fasta perl ~/usr_scripts/format_conversion.pl insect8geneB.phy.gene4_insect18S_mafft_fas.phy insect8geneB.18Sonly phylip fasta perl ~/usr_scripts/format_conversion.pl insect8geneB.phy.gene5_insectWNT_clo_fas.phy insect8geneB.WNTonly phylip fasta perl ~/usr_scripts/format_conversion.pl insect8geneB.phy.gene6_insect12S_clo_fas.phy insect8geneB.12Sonly phylip fasta perl ~/usr_scripts/format_conversion.pl insect8geneB.phy.gene7_insectCYTB_clo_fas.phy insect8geneB.CYTBonly phylip fasta ############################################################################################### # SKIP: # if theres sp in the backbone tree which are absent in the new data, # these need pruning from the tree first # they make life awkward later. # $process_backbone_tree_terminal_IDs = 1; rm RAxML_bestTree.mtprot_constr.6_tax_rm.pruned perl ~/usr_scripts/prune_tree.pl -seqfile endop3geneD1 -output RAxML_bestTree.mtprot_constr.6_tax_rm.pruned -treefile RAxML_bestTree.mtprot_constr.6_tax_rm # not sure if this is req: backbone_tree=/home/douglas/scripted_analyses/COIevol/corrections/mtgenomes/InsMito_sumtrees.procd perl ~/usr_scripts/prune_tree.pl -seqfile insect8geneB -output mt_backbone.pruned -treefile $backbone_tree # data is just unweildy for constraining (with tnt) and treesearch .... # split sequences and trees by order. # $process_backbone_tree_terminal_IDs = 0; #rm *.order.nwk *.order.taxlist *.order.taxlist2 #perl ~/usr_scripts/split_by_order.pl -node 50557 -seqfile endop3geneC -treefile RAxML_result.mtprot_constr.pruned ## count taxa for each order on the backbone tree: ## Hemiptera:1, * Trichoptera:2,*Siphonaptera:0,Raphidioptera:0,Strepsiptera:0 ## Lepidoptera:199,Diptera:95, Coleoptera:44, Hymenoptera:30, Neuroptera:14, Megaloptera:9, Mecoptera:3 #order=Hymenoptera ##$remove_or_retain = 2 #rm endop3geneC.m_rm $order.fas #perl /home/douglas/usr_scripts/remove_fasta_entries.pl endop3geneC $order.order.taxlist2 #mv endop3geneC.m_rm $order.fas #### make tnt constrints file ### almost complete #perl ~/usr_scripts/infer_constraints.pl -node 50557 -seqfile $order.fas -treefile $order.order.nwk rm $order.phy RAxML*.$order perl ~/usr_scripts/format_conversion.pl $order.fas $order.phy fasta phylip # a few days: raxmlHPC-8.1.2 -s $order.phy -n $order -m GTRCAT -c 25 -g $order.order.nwk.rax_constraint -p 12345 # or, one or two days:. why no -F (avoid final optimization) raxmlHPC-8.1.2 -D -e 1.0 -s $order.phy -n $order -m GTRCAT -c 4 -g $order.order.nwk.rax_constraint -p 12345 # scrap tnt. took me ages to get a script to make the constraints. # and turns out tnt has very limited number of taxa allowed to float. # need to check taxa in fasta file is read in same order as nex file # TNT #rm tnttrees #perl ~/usr_scripts/format_conversion.pl $order.fas tnt_input.nex fasta nexus_concise ############################################################################################### Skip ALSO: # read mitogenome tree, infer constraints to be placed on tree search of barcode-type data # insecta:50557 endop:33392 neoptera (monophyletic wing-folding insects):33340 # Hexapoda:6960 # FULL DATASET, fasttree. # put on the backburner. turns out quite complex to make contraints. # might come back to it. # need the script to determine whether the taxon names are monophyloetic on the tree # if so, they can be constrained. this is probably not too hard, despite often nested labels. # also need to read relationships between taxa, which is more complex. # e.g COL next to STREP, these are two orders of ENDOP, # remaining orders of ENDOP need excluding from the monophy constraint. # were new members present of an order absent on the backbone. # these would need to float. # only easy to visualise at this rank... # # i) does ft accept non-comprehensive tree. ii) does it accept comprehensive non-fully-bifucating tree # if so, you can just input the raxml tree as a ft starting tree. # 3 OUTPUTS: $treefile.rax_constraint; $fas_file.$treefile.tnt; $fas_file.$treefile.ft # hacked to make order level constrinats: #perl ~/usr_scripts/infer_constraints.pl -node 50557 -seqfile endop3geneC -treefile RAxML_result.mtprot_constr.pruned # new script, prev was suible for making constraint tree, this is geared towards fasttree style #perl ~/usr_scripts/infer_flexible_constraints_from_backbone_tree.pl -node 50557 -seqfile endop3geneC -treefile RAxML_result.mtprot_constr.pruned ### 3rd time lucky ... 3rd major iteration of this algorithm. ## prev was problematix since obvious and deep level contraints were often not applied due to tip level ambiguity. ## should instead favour the deep level. #################################################################################### ## **** RESTART HERE **** ## cd /home/douglas/scripted_analyses/COIevol/corrections/species_dense/ # makes tab-delim table, showing constraints (x), taxa (y), all states # also , if there are constraints with no floating taxa (or, very few), such as orders, # print these to seperate file. these will work with TNT # following, 1:tree constraints in fasttree format. 2=human-readable # 212 taxonomic monophylys set. method is probably a bit liberal, # for example a family that has only 2 species present, but these are sister species, # the family will be set as monophyletic. might want to modify, only if the family has # at least two subfamilies, and the # regular supermatrix:endop3geneD1, identical sequences removed for pplacement: endop3geneDreducd1 rm ftcons fasttree_constraints_tabdelim backbone_tree=/home/douglas/scripted_analyses/COIevol/corrections/mtgenomes/InsMito_sumtrees.procd # $process_backbone_tree_terminal_IDs = 0; perl ~/usr_scripts/read_deep_level_constraints.pl -node 6960 -seqfile insect8geneB -treefile $backbone_tree # -outgroup Acyrthosiphon_pisum # try non-comprehensive raxml constraint tree .. doesnt work as start tree. #fasttree_2.1.7 -nosupport -gtr -gamma -nt -intree RAxML_result.mtprot_constr.pruned.rax_constraint < endop3geneC > endop3geneC.ftresults2 #fasttree_2.1.7 -nosupport -gtr -gamma -nt -constraints endop3geneC.RAxML_result.mtprot_constr.pruned.ft < endop3geneC > endop3geneC.ftresults_extraorderconstr # ** the species dense tree search.... ftc, still trying to get the outgroup in the right place. # d: taxonomic mp's ; write a logfile, it seems pplacer will read ft logs: -log fasttree_logfile # memory usage is 72% for insects, running time constrained 6 hours. # default constraints weight 100;Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000; Search: Normal +NNI +SPR (2 rounds range 10) +ML-NN rm insect8geneB.ftd fasttree_2.1.7 -log fasttree_logfile -gtr -gamma -nt -constraints ftcons < insect8geneB > insect8geneB.ftd.support fasttree_2.1.7 -log fasttree_logfile -gtr -gamma -nt < insect8geneB > insect8geneB.ftd.unconstrained.support # constrained and unconstrained: perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile insect8geneB.ftd -output insect8geneB.ftd.taxlabs -node 6960 -newick perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile insect8geneB.ftd.unconstrained.support -output insect8geneB.ftd.unconstrained.support.taxlabs -node 6960 -newick java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file insect8geneB.ftd.taxlabs java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file insect8geneB.unconstrained.taxlabs # to aid veiwing this massivly complex and polyphyletic tree: perl ~/usr_scripts/taxonomic_prune_dense_tree.pl -treefile insect8geneB.ftd.unconstrained.support -treefileB insect8geneB.ftd.unconstrained.support.taxlabs -output test7 -node 6960 #Schistocera_alutacea Petalura_ingentissima Degeeriella_regalis Fannia_pusio Lucilia_eximia Goniozus_jacintae Nipponaetes_haeussleri # raxml reports there are multifurcations in the ft outtree, # so resolv these first by using as a constraint tree: # also, later you will need a raxml parameters file in the older-version format, so use older version of raxml here. # the the fasttree input file will not be rooted (it doesnt have this option). perl ~/usr_scripts/format_conversion.pl insect8geneB insect8geneB.phy fasta phylip # just did a ft tree, after that, no point resolving this few nodes carefully. -F = avoid final optimization # no branchlengths given under these settings. # -F rm R*endop3geneE.ft # big alignment output ft_constraint raxmlHPC-8.1.2 -F -D -e 1.0 -c 4 -s insect8geneB.phy -n insect8geneC -g insect8geneB.ftd.support -m GTRCAT -p 12345 # output is not rooted at the outgroup .... # very high memory demand at this stage, use newest version of raxml. old one struggles. # v8.1.2 requires about 50% memory of 7.2.8! # get parameters file later, only req if using pplacer. insect8geneC # DEC2015 NOTE, resolution of fasttree polytomies and final optimization , both with raxml, # now done later on, on cluster. # remember to reroot then ultrametricize perl ~/usr_scripts/newick/root_with_bioperl.pl RAxML_bestTree.insect8geneC Acerentomon_microrhinus perl ~/usr_scripts/insert_family_name_into_newick.pl RAxML_result.endop3geneD1.ft key_Oct2013_Insecta java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file RAxML_result.endop3geneD1.ft.fams RAxML_result.endop3geneD1.ft.rerooted_at.Acyrthosiphon_pisum # should not have skipped bl optimization .... well, should have , it takes way too much memory. #raxmlHPC-8.1.2 -f e -e 1.0 -s endop3geneD1.phy -n endop3geneD1.ft2 -m GTRCAT -c 4 -t RAxML_result.endop3geneD1.ft -p 12345 -o Acyrthosiphon_pisum # is taking too long, get branchlengths with fasttree; -nome -mllen -intree: optimize blengths for fixed topology #fasttree_2.1.7 -nosupport -gtr -gamma -nt -nome -mllen -intree RAxML_result.endop3geneD1.ft < endop3geneD1 > endop3geneD1.ft2 # problem with ft brnachlengths, some are zero, pplacer complains that placement will fail at these due to 0 LNL # damit, ft creates polytomies, even if you specfy constrained bifurcating tree. # instead, optimize in coi column only, since you need model for COI anyway (actually dont, pplacer uses too much memory), and will be quicker.. # a few need rm: perl ~/usr_scripts/rm_seqs_missing_from_taxonomyDB.pl -node 50557 -seqfile endopCOI.pynast2.reducd.accession_rm -output endopCOI.pynast2.reducd.accession_rm2 perl ~/usr_scripts/format_conversion.pl endopCOI.pynast2.reducd.accession_rm2 endopCOI.pynast2.reducd.accession_rm2.phy fasta phylip # 'tree evaluation' not possible with gtrcat, seems v7 only. rm R*endopD1.ft3 # GTRGAMMA crashes this time: #raxmlHPC-7.2.8-ALPHA -f e -e 1.0 -s endopCOI.pynast2.fas.accession_rm2.phy -n endopD1.ft3 -m GTRGAMMA -c 4 -t RAxML_result.endop3geneD1.ft -p 12345 -o Acyrthosiphon_pisum # and this one working. tempermental, try both. #raxmlHPC-8.1.2 -f e -e 1.0 -c 4 -s endopCOI.pynast2.fas.accession_rm2.phy -n endopD1.ft3 -m GTRCAT -t RAxML_result.endop3geneD1.ft -p 12345 -o Acyrthosiphon_pisum # split up the supermatrix used, so taxa are precisly the same. doesnt work since columns are rm during raxml reduction, parition filei is not updated # for using pplacer later, try making a better model raxmlHPC-8.1.2 -f e -e 0.1 -c 12 -s endopCOI.pynast2.fas.accession_rm2.phy -n endopD1.ft3b -m GTRCAT -t RAxML_result.endop3geneD1.ft -p 12345 -o Acyrthosiphon_pisum # print list of taxa with long branches. set length of branch at which taxa need removing. # option in this script to 'unroot', which pplacer prefers. # aslo consider processing branchlengths, check for presence of 0 length rm RAxML_result.endopD1.ft3b.BL1 RAxML_result.endopD1.ft3b.pruned_taxlist perl ~/usr_scripts/prune_long_branches.pl RAxML_result.endopD1.ft3b 0.8 RAxML_result.endopD1.ft3b.BL1 # update alignment, take out taxa that were pruned from the tree. #$remove_or_retain = 2 rm endopCOI.pynast2.fas.accession_rm2.m_rm perl ~/usr_scripts/remove_fasta_entries.pl endopCOI.pynast2.fas.accession_rm2 RAxML_result.endopD1.ft3b.pruned_taxlist # make tree ultrametric. # first tried NEOPTERA Acyrthosiphon_pisum Drosophila_melanogaster, later raxmlEPA complained about the newick string, # endop, makes no diff to that problem... # insect8geneC rm RAxML_result.endopD1.ft3.BL1.treePL treePL configfile_Jan16 treefile = RAxML_result.endopD1.ft3.BL1 smooth = 100 numsites = 650 mrca = ENDOP Apis_dorsata Drosophila_melanogaster min = ENDOP 10 max = ENDOP 10 outfile = RAxML_result.endopD1.ft3.BL1.treePL mrca = NEOPTERA Acyrthosiphon_pisum Drosophila_melanogaster min = NEOPTERA 10 max = NEOPTERA 10 # RAxML_bestTree.insect8geneC.rerooted_at.Acerentomon_microrhinus treefile = RAxML_bestTree.insect8geneC smooth = 100 numsites = 650 mrca = ENDOP Apis_dorsata Drosophila_melanogaster min = ENDOP 10 max = ENDOP 10 outfile = insect8geneC.treePL # treePL makes some branch to nowhere ... #((INTERNAL_NODE_38790:2.438137,Acyrthosiphon_pisum:12.438137):0.061855); # output is XXX.b perl /home/douglas/usr_scripts/process_treePL_outtree.pl RAxML_result.endopD1.ft3.BL1.treePL perl /home/douglas/usr_scripts/print_long_branches.pl RAxML_result.endopD1.ft3.BL1.treePL 100 outputt #$which_rank = 3(order+family); $labels_used = 1 (binomial) perl ~/usr_scripts/insert_family_name_into_newick.pl RAxML_result.endopD1.ft3.BL1.treePL key_Oct2013_Insecta java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file RAxML_result.endopD1.ft3.BL1.treePL.fams # Acerentomon_microrhinus # ***** SPECIES LEVEL REFERENCE TREE IS NOW COMPLETE ***** # need to modify branchlengths in line with backbone tree, # maybe part-ultrametricize the tree would look nice. # NOV2015, # make species dense tree just using the genomic constraints, # requires combing mtgenome data and species desnse, then mixed partition. # actually, only ~800 mtgenomes, at least many thousdand in the sp-dense, # and awkward to implement, dont bother using the mtgenome parition for this one, # rm backbone_constrained.BOOT* perl ~/usr_scripts/process_backbone_tree_for_setting_constraints.pl -newick -node 6960 -boot_trees insectaNUCL.all_boots.rand -supermatrix insect8geneB -output backbone_constrainedTEST perl ~/usr_scripts/format_conversion.pl insect8geneB insect8geneB.phy fasta phylip # skip all this, its too slow # for the 100 nuclear boots, make a sp dense constraint tree #rm ft_BOOT*.fasttree_format_constraint #for i in {1..100}; do #perl ~/usr_scripts/newick_backbone_to_fasttree_constraint_format.pl -treefile backbone_constrainedTEST.BOOT$i -references insect8geneB -node 6960 -output ft_BOOT$i #done # # bootstrap the sp dense alignment, it makes v big files ... #rm insect8geneB.phy.BS* #raxmlHPC-8.1.2 -f j -s insect8geneB.phy -n NULL -m GTRCAT -b 12345 -# 100 # #raxmlHPC-8.1.2 -D -e 1.0 -F -s insect8geneB.phy.BS$i -n insect8geneB_B$i -m GTRCAT -p 12345 -g backbone_constrainedTEST.BOOT$i & # # this could be done with tnt, although awkward to make the constraints input #-log ftBS_logfile # ~2.5 gb memory for one search # -gamma is not neccesary for a bootstrap, used for getting more accurate lnl's in toppology stats #rm RAxML*insect8geneB_B* # rm insect8gene*ft_sd_bs # fasttree outputs # rm insect8geneB.fas.BS* # # requires 4.2gb memory per tree, 4 hours #for i in {1..10}; do #perl ~/usr_scripts/format_conversion.pl insect8geneB.phy.BS$i insect8geneB.fas.BS$i phylip fasta #fasttree_2.1.7 -gtr -nt -nosupport -constraints ft_BOOT$i.fasttree_format_constraint < insect8geneB.fas.BS$i > insect8gene.BS$i.ft_sd_bs # ok, this is gonna take forever. so, just use the nuclear ome consensus tree and do a single fasttree seach cat /home/douglas/scripted_analyses/COIevol/corrections/results/RAxML_bootstrap.insectaNUCL.boot.RM* > all_nucl_ome_boots sumtrees.py --min-clade-freq=0.0 --log-frequency=10 --to-newick --replace --support-as-labels --burnin=0 --output=all_nucl_ome_boots.sumtrees all_nucl_ome_boots perl ~/usr_scripts/newick/process_raxmlEPA_outtree.pl all_nucl_ome_boots.sumtrees perl ~/usr_scripts/process_backbone_tree_for_setting_constraints.pl -newick -node 6960 -boot_trees all_nucl_ome_boots.sumtrees.procd -supermatrix insect8geneB -output insect8gene.nucl_constr perl ~/usr_scripts/newick_backbone_to_fasttree_constraint_format.pl -treefile insect8gene.nucl_constr.BOOT1 -references insect8geneB -node 6960 -output insect8gene.nucl_constr fasttree_2.1.7 -gtr -gamma -nt -constraints insect8gene.nucl_constr.fasttree_format_constraint < insect8geneB > insect8gene.nucl_ome_sumtree_constrained perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile insect8gene.nucl_ome_sumtree_constrained -output insect8gene.nucl_ome_sumtree_constrained.tax -node 6960 -newick # should have severla full trees, pruned (if neccesary) so perfectly overalpping, # then do robinson folds , tRI, comapre support etc. Carle, F. L. (1986). The cl (Ramulus + (Heteropteryx + (Phyllium + (Neohirasea + Sipy-loidea)))) and (Phyllium + ((Heteropteryx + Ramulus) + (Neohirasea + Sipyloidea))) in the previous and present examinations, respectivel (Sipyloidea,(Heteropteryx,(Ramulus, (Phyllium,Neohirasea)))) Whiting, M.F., Bradler, S., Maxwell, T., 2003. Loss and recovery of wings in stick insects. Nature 421, 264–267. # 4 nuclear ome trees, easy to look at and printed: /home/douglas/scripted_analyses/COIevol/corrections/results/nucl_ome_RM.tif # mtgenome constrained: /home/douglas/scripted_analyses/COIevol/corrections/mtgenomes/InsMito_sumtrees.procd #; unconstrained: /home/douglas/scripted_analyses/COIevol/corrections/mtgenomes/InsMito_UNconstr.sumtrees.procd # SPECIES DENSE, unconstrained: insect8geneB.ftd.unconstrained.support # nucl-mt constrained insect8geneB.ftd # consensus1000nucl constrained: insect8gene.nucl_ome_sumtree_constrained.tax java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file insect8gene.nucl_ome_sumtree_constrained.tax java -Xms64m -Xmx512m -jar /home/douglas/software/figtree/FigTree_v1.4.1/lib/figtree.jar $* cd /home/douglas/scripted_analyses/COIevol/corrections/species_dense/ # run some steps on cluster: # already has new verison of raxml installed # copy input files to cluster: scp insect8geneB.phy wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/insect8geneB.phy scp insect8geneB.ftd.support wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/insect8geneB.ftd.support scp insect8geneA.partitionfile wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/insect8geneA.partitionfile scp cluster_job_submit1 wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/cluster_job_submit1 #PBS -N DC_raxml_1 #PBS -l nodes=1:ppn=1 #PBS -q big #PBS -d /home/wei/DC/insectTreeDec2015 #PBS -j oe /home/wei/software/standard-RAxML-master/raxmlHPC-SSE3 -D -e 1.0 -c 4 -s insect8geneB.phy -n insect8geneC -g insect8geneB.ftd.support -m GTRCAT -p 12345 -q insect8geneA.partitionfile # login: ssh -l wei 10.0.0.181 # wei789 99247.admin qstat -f 99252.admin # 2nd go, run with partition file, will need coi and cyt specific parameters for later ... # copy results to local directory scp wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/R*insect8geneC . # just for fun (and suggested by editor), try and analyse a single matrix of all 3 partitions # all should have binomial labels nucl_matrix=/home/douglas/scripted_analyses/COIevol/corrections/transcriptomes/insectaNUCL.smatrix.RM1.epu mt_matrix=/home/douglas/scripted_analyses/COIevol/corrections/mtgenomes/InMT.smatrix barcode_matrix=/home/douglas/scripted_analyses/COIevol/corrections/species_dense/insect8geneB # $remove_accession = 0; $required_data = ""; perl ~/usr_scripts/concatenate_v2.pl $barcode_matrix $mt_matrix $nucl_matrix mv current_supermatrix2 big_matrix fasttree_2.1.7 -gtr -nt < big_matrix > big_matrix.out perl ~/scripted_analyses/COIevol/scripts/calcualte_missing_data.pl big_matrix # total missing_data_chars:4150803792;non_missing_total_length:64750642 4150803792+64750642=4215554434 4150803792/4215554434=0.98464 endopCOI.pynast2.fas.accession_rm 84074 chars perl ~/scripted_analyses/COIevol/scripts/calcualte_missing_data.pl $nucl_matrix perl ~/scripted_analyses/COIevol/scripts/calcualte_missing_data.pl $mt_matrix perl ~/scripted_analyses/COIevol/scripts/calcualte_missing_data.pl $barcode_matrix #rm R*endop3geneC.ftstarttree2 #raxmlHPC-8.1.2 -F -s endop3geneC.phy -n endop3geneC.ftstarttree2 -m GTRCAT -c 25 -t RAxML_result.endop3geneC.ftstarttree1 -p 12345 # ft does not have a outgroup option, so: #perl ~/usr_scripts/newick/root_with_bioperl.pl perl ~/usr_scripts/insert_family_name_into_newick.pl endop3geneC.ft.27MAR2015.rerooted_at.Acyrthosiphon_pisum key_Oct2013_Insecta java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file endop3geneC.ft.27MAR2015.rerooted_at.Acyrthosiphon_pisum.fams # the taxonomic names for each node were inferred. these ranged from the most inclusive (lowest rank shared by all decendent species) to the taxon one rank below that assigned to the parent node in the tree. in practice the majority of nodes had a single taxon name, although some had >1. the members of the species dense were assigned to nodes of this tree (through extension of polytomies), however the frequent duplication of taxon names through the tree meant this was an ambiguous. the omic tree serves as a backbone, constraining search in species dense data. new species were placed within these constraints via sharing of taxonomic labels with those assigned to internal nodes (as the encompess decendents in the omics tree) however the placement of new species within these contraints is frequently ambiguous. A shows unambiguos case, where the species XXX, a member of the family colletidaer is placed by extension of a polytomy in the node assigned that name (decended by the only two members of the same family found in the tree, monophyletic), However choice of two nodes for placment of Apis andrenafloris similarly, members of the tribe apini exceping apis (such as cxxx) can be placed at the base of the apis. a conservative approach would be not placing these, or collapsing the constraint cat endop3gene missing >endop3gene.hack perl ~/usr_scripts/format_conversion.pl endop3gene endop3gene.phy fasta phylip perl ~/usr_scripts/format_conversion.pl endop3gene.hack endop3gene.hack.phy fasta phylip # one or two days: raxmlHPC-8.1.2 -D -e 1.0 -s endop3gene.hack.phy -n rax345 -m GTRCAT -c 4 -g newick_string5 -p 12345 # or make a better tree .. about a week: raxmlHPC-8.1.2 -s invCOI.blastout.retreived.ID_filtered.mothur.phy -n rax3 -m GTRCAT -c 25 -g blashnb -p 12345 # sucessfully got topology, but crashed at bracnh length optimization, so repeat that: raxmlHPC-8.1.2 -f e -s invCOI.blastout.retreived.ID_filtered.mothur.phy -n 24Dec -m GTRCAT -c 25 -t RAxML_result_copy.rax3 #$which_rank = 1; # 1=family, 2=order, 3=order+family #$labels_used = 1; #1=binomial. 2=genus only rm RAxML_result.Hymenoptera.fams perl ~/usr_scripts/insert_family_name_into_newick.pl RAxML_result.Hymenoptera key_Oct2013_Insecta # new key file: /home/douglas/scripted_analyses/COIevol/analysis/key_Oct2013_Insecta java -Xms64m -Xmx512m -jar /home/douglas/software/figtree/FigTree_v1.4.1/lib/figtree.jar $* ################################################################################################################ # get report on taxonomic groups found in files fileA=/home/douglas/scripted_analyses/COIevol/transcriptomes/transcriptomes.fs fileA=/home/douglas/scripted_analyses/COIevol/mtgenomes/mtprotein fileA=/home/douglas/scripted_analyses/COIevol/genomics/insect_refseq_genera_proteins.b # this is the alignment used to make the sp level tree, but with pruned taxa taken out fileA=/home/douglas/scripted_analyses/COIevol/analysis/endopCOI.pynast2.fas.accession_rm2.m_rm perl /home/douglas/usr_scripts/analyse_database2.pl $fileA /home/douglas/scripted_analyses/COIevol/genomics/key_Oct2013_Arthropoda ~200 genera in the mtgenome matrix 18 ORDERS overlapping mtgenome and nuclear genome trans genome MT Archaeognatha X X Blattodea X X Coleoptera X X Dermaptera X X Diptera X X Embioptera X X Ephemeroptera X Grylloblattodea X Hemiptera X X Hymenoptera X X Isoptera X X Lepidoptera X X Mantodea X X Mecoptera X X Megaloptera X X Neuroptera X X Odonata X X Orthoptera X X Phasmatodea X Phthiraptera X Plecoptera X Psocoptera X Raphidioptera X Siphonaptera X X Strepsiptera X Thysanoptera X X Trichoptera X Zoraptera X mtprotein Archaeognatha Blattodea Coleoptera Dermaptera Diptera Ephemeroptera Hemiptera Hymenoptera Isoptera Lepidoptera Mantodea Mecoptera Megaloptera NA Neuroptera Odonata Orthoptera Phasmatodea Plecoptera Psocoptera Siphonaptera Thysanoptera ################################################################################################################ # ***** PHYLOGENETIC PLACEMENT OF QUERIES ***** # need to test against unconstrained tree. # not obvious to me how a stat threshold would be obtained for if differences between assignment method were significant, for example berger 2011 just seem to list distances to correct values, for each method. Weitschek 2014; does pairwise wilcox between prorposed method and each other, also good idea, they split it up for several taxonomic groups could do leave one out type analysis, this would quantify incorrect assignments, then get seqs filtered during species filtering also. benchark dataset was generated according to chesters et al, with the advantage that no chages are required to the reference tree, an alternative apprach being 'leave-one-out' (e.g. Austerlitz et al 2009). Frederic Austerlitz, Olivier David, Brigitte Schaeffer, Kevin Bleakley, Madalina Olteanu, Raphael Leblois, Michel Veuille, Catherine Laredo DNA barcode analysis: a comparison of phylogenetic and statistical classification methods BMC Bioinformatics. 2009; 10(Suppl 14): S10. # NOTES on potential query files. # # i) # can 'test' with named reference data, # heberts original dna barcode 'test' members, # while numbering only ~40 in sequence for the endopterygota, these are taxonomically diverse, each from a unique family. # /home/douglas/scripted_analyses/COIevol/files/hebert_endop_test_gi_numbers # there were 25 of these in the unfiltered COI db # some of these are species with lots of sequences, so the particular hebert seq would be filtered out. # there will be a small number of the 40, which are still in the coi barcode set. # could just use these as some kind of control. # ii) # also, get daily releases like last time. # and present 're-analyses' of insect otu studies that didnt provide names. # 1) bee data from MLCO paper. # 2) primate_diet_cytb.mafft;primate_diet_cytb.aligned_to_references # in the paper they made identifications to level of insect family only. # 3) porter2014, insect malaise 1000 COI seqs. # can inclue coi barcode and mtgenomic ... # relevent datasets found by looking for unclassified taxa (or environmental samples) in ncbi taxonomy website. ############################################################################################## # **** USED QUERY FILES **** # Porter et al. 2014. 1050 seqs, order level labels. singe malaise sample. Rapid and accurate taxonomic classification of insect (Class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a na?ve Bayesian classifier http://datadryad.org/resource/doi:10.5061/dryad.bc8pc a second sample was used in the paper, but unrelated marine one. file name malaise.fasta gibson 2014 **** turns out SAME DATASET AS PORTER!! 1066 arthropods (or 1046). assigned at various ranks at various rtates. manually process the fasta file. rm ' cytochrome oxidase subunit I-like (CO1) gene, partial sequence; mitochondrial' then remove spaces. rm semicolon Simultaneous assessment of the macrobiome and microbiome in a bulk sample of tropical 1066 individuals from a single malaise sample. sangar then further NGS of tissue from each individual. http://www.ncbi.nlm.nih.gov/popset?LinkName=pubmed_popset&from_uid=24808136 1052 sequences, ~310 bp. Yu2012 (673 seqs, 614bp). each seq labelled as from honghe, XSBN, KMG. could be used in colored labelling of tree. Biodiversity soup: metabarcoding of arthropods for rapid biodiversity assessment and biomonitoring Yu,D.W., Ji,Y., Emerson,B.C., Wang,X., Ye,C., Yang,C. and Ding,Z. (2012) Methods Ecol Evol Shokralla2014_BP, both plates, prob. sangar need processing:manually rm gaps from Shokralla dataset Next-generation DNA barcoding: using next-generation sequencing to enhance and accelerate DNA barcode capture from single specimens. Shokralla,S., Gibson,J.F., Nikbakht,H., Janzen,D.H., Hallwachs,W. and Hajibabaei,M. (02-19-2014) Mol Ecol Resour lepidoptetras, mt genoomic dna, a couple other datasets also 95 specimens reared for ngs study, 95 reared for dna barcoding as far as i can see, sangar sequence plate 1 and two are on dryad, but not ncbi, all the ngs data is on ncbi thus, sanger plate data would be good for method validation, and ngs for pretty re-analysis sequences labels seems have species names, so use for benchmarking FILE processing: fasta ids are just sequenctial numbers, so rename with a and b according to the two files, then combine into one. cd /home/douglas/scripted_analyses/COIevol/files/ cat Shokralla2014_454a Shokralla2014_454b > Shokralla2014_454 tried derepicating, but seems processed already. Stahlhut2013, Hymenoptera. 7478 seqs. no pretty trees in their paper. their complete dataset crashes. one problem is 1/3 of the data are duplicate sequence cannot dereplicate the unaligned file with raxml, so have to do that later. ZHOU2013 /home/douglas/scripted_analyses/genomeMOTU/zhao2013 # make a reduced file, full genomic scafseq file takes forever just to align, seems theree is a small number with identifications given, test on these, cd /home/douglas/scripted_analyses/genomeMOTU/zhao2013/ perl ~/scripted_analyses/genomeMOTU/scripts/process_zhao_file.pl K61dR2_k45.scafSeq OTU_and_scaf_table makees file: /home/douglas/scripted_analyses/genomeMOTU/zhao2013/K61dR2_k45.scafSeq.proc PICKETT2012 /home/douglas/scripted_analyses/COIevol/analysis/primate_diet_cytb # .mafft clustered metagenomic data at 2 alternative thresholds preliminary taxonomic assignments only made, assigning to family via blast top hit. # copy NEW QUERY datasets cp /home/douglas/scripted_analyses/COIevol/files/malaise.fasta porter2014 cp /home/douglas/scripted_analyses/COIevol/files/Gibson2014 Gibson2014 cp /home/douglas/scripted_analyses/COIevol/files/Yu2012 Yu2012 cp /home/douglas/scripted_analyses/COIevol/files/Shokralla2014_BP Shokralla2014_BP cp /home/douglas/scripted_analyses/COIevol/files/Stahlhut2013 Stahlhut2013 cp /home/douglas/scripted_analyses/COIevol/benchmark/invDailyQ.blastout.retreived invDailyQ cp /home/douglas/scripted_analyses/COIevol/files/Shokralla2014_454 Shokralla2014_454 # DEC2015 benchmarks:invDQ.Heteroneura invDQ.Apocrita invDQ.Diptera invDQ.Coleoptera invDQ.Panheteroptera # copy COI reference alignment. this has accessions removed from IDs ... # and only taxa included in the supermatrix (some filtered e.g. on use taxa with a COI) #cp /home/douglas/scripted_analyses/COIevol/analysis/endopCOI.pynast2.fas.accession_rm2 endopCOI.pynast2.fas.accession_rm2 # and taxa corresponding to nodes with long brnaches rm: cp /home/douglas/scripted_analyses/COIevol/analysis/endopCOI.pynast2.fas.accession_rm2.m_rm endopCOI.pynast2.fas.accession_rm2.m_rm # ft->quick raxml, has no multifurcations, but no brnahclengths #cp /home/douglas/scripted_analyses/COIevol/analysis/endop3geneD1.ft endop3geneD1.ft # not ideal, but ft->rax->ft #cp /home/douglas/scripted_analyses/COIevol/analysis/endop3geneD1.ft2 endop3geneD1.ft2 # ft->rax(resolve)->rax(CO1 only) cp /home/douglas/scripted_analyses/COIevol/analysis/RAxML_result.endopD1.ft3 RAxML_result.endopD1.ft3 # ft(species_level)->rax(resolve)->rax(CO1 only)->PL cp /home/douglas/scripted_analyses/COIevol/analysis/RAxML_result.endopD1.ft3.BL1.treePL.b RAxML_result.endopD1.ft3.BL1.treePL.b # pplacer wants raw tree: cp /home/douglas/scripted_analyses/COIevol/analysis/RAxML_result.endopE1.ft3b RAxML_result.endopE1.ft3b # DEC2015: cp /home/douglas/scripted_analyses/COIevol/corrections/species_dense/insect8geneB.COIonly insect8geneB.COIonly cp /home/douglas/scripted_analyses/COIevol/corrections/species_dense/RAxML_bestTree.insect8geneC RAxML_bestTree.insect8geneC # invDQ.Carabidae invDQ.Staphylinidae invDQ.Cucujiformia invDQ.Chironomidae invDQ.Muscoidea invDQ.Oestroidea invDQ.Apoidea invDQ.Obtectomera queryfile=invDQ.Carabidae perl ~/usr_scripts/taxonomic_report.pl -seqfile $queryfile -output $queryfile.tax_report -node 6960 -fasta # ****** START ****** # another possibility, theres a paper i named ******, they use hmmer for alignment, look into that cd /home/douglas/scripted_analyses/COIevol/classification/ # file names: porter2014;Gibson2014;Yu2012,Shokralla2014_BP,Stahlhut2013,invDailyQ,Shokralla2014_454 # DEC2015: # reference COI alignment:insect8geneB.phy.gene0_insectCOI_pynast2_reducd.phy OR insect8geneB.COIonly # reference tree:RAxML_bestTree.insect8geneC # benchmarks:invDQ.Heteroneura invDQ.Apocrita invDQ.Diptera invDQ.Coleoptera invDQ.Panheteroptera # structure of those benchmarks is wierd, and some unweildy. made some new ones: # tax=(invDQ.Carabidae invDQ.Staphylinidae invDQ.Cucujiformia invDQ.Chironomidae invDQ.Muscoidea invDQ.Oestroidea invDQ.Apoidea invDQ.Obtectomera) queryfile=Shokralla2014_454 reference_alignment=insect8geneB.COIonly referencetree=RAxML_bestTree.insect8geneC # from raxml v7.2.8, during resolution of ft polytomies: # raxml_info_file=/home/douglas/scripted_analyses/COIevol/analysis/RAxML_info.endop3geneD1.ft # rax v7.2.8, using COI only, # dont need info file, unless using pplacer # raxml_info_file=/home/douglas/scripted_analyses/COIevol/analysis/RAxML_info.endopD1.ft3 # this old info file seems only one in right format: # raxml_info_file=/home/douglas/scripted_analyses/COIevol/analysis/RAxML_info.24Dec # ALIGNMENT OF QUERIES TO REFS # porter COI, all 310bp, gibson, curiously , the same. Yu2012 614bp # PORTER, GIBSON, INV_DAILIES # --pairwise_alignment_method = 'muscle','mafft','clustal','pair_hmm','blast','uclust' # muscle was ****, clustal slightly less ****, so, use the least ****. rm $queryfile.pynast1 $queryfile.pynast2 pynast --pairwise_alignment_method clustal -i $queryfile -t $reference_alignment --min_pct_id=80 --min_len=300 --fasta_out_fp=$queryfile.pynast1 perl ~/usr_scripts/format_conversion.pl $queryfile.pynast1 $queryfile.pynast2 fasta fasta rm $queryfile.pynast2.andrefs cat $reference_alignment $queryfile.pynast2 > $queryfile.pynast2.andrefs # YU, SHOKRALLA, STAHLHUT, Shokralla2014_454 # shokralla at threshold 0.7 was awful mothur "#align.seqs(candidate=$queryfile, template=$reference_alignment, search=kmer, align=needleman, threshold=0.9)" # try screen.seqs next ## change to fasta_missing_all format , changes non-standard gap char '.', used by mothur rm Stahlhut2013.mothur perl ~/usr_scripts/format_conversion.pl Stahlhut2013align Stahlhut2013.mothur fasta fasta_missing_all cat $reference_alignment $queryfile.mothur > $queryfile.pynast2.andrefs #### STAHLHUT crashes epa, split in two manually. ####cat $reference_alignment $queryfile.mothurB > $queryfile.pynast2.andrefsB # thats no good for printing tree results, so try dereplicating alignment. perl ~/usr_scripts/format_conversion.pl Shokralla2014_454align Shokralla2014_454.mothur fasta fasta_missing_all perl ~/usr_scripts/format_conversion.pl Shokralla2014_454.mothur Shokralla2014_454.mothur.phy fasta phylip raxmlHPC-8.1.2 -f c -s Shokralla2014_454.mothur.phy -m GTRCAT -n Shokralla2014_454derep perl ~/usr_scripts/format_conversion.pl Stahlhut2013.mothur.phy.reduced Stahlhut2013.mothur.derep phylip fasta cat $reference_alignment Stahlhut2013.mothur.derep > $queryfile.pynast2.andrefs # ZHOU2013 , PICKETT2012 # for these , will be aligning to a supermatrix, as opposed to COI only... cd /home/douglas/scripted_analyses/COIevol/corrections/species_dense/ #$remove_accession = 1;$missing_data_char = "?";# $required_data = "Y??"; rm charsetfile2 current_supermatrix2 partitionfile2 partition_patterns perl /home/douglas/usr_scripts/concatenate_v2.pl insectCOI.pynast2.reducd insect16S.clo.fas insectCYTB.clo.fas mv current_supermatrix2 insect3geneA; mv partitionfile2 insect3geneA.partitionfile; mv charsetfile2 insect3geneA.charsetfile; cd /home/douglas/scripted_analyses/COIevol/classification/ cp /home/douglas/scripted_analyses/COIevol/files/primate_diet_cytb primate_diet_cytb # this was too big: #cp /home/douglas/scripted_analyses/genomeMOTU/zhao2013/K61dR2_k45.scafSeq Zhao2013 # this just has the idnitified scafolds in: cp /home/douglas/scripted_analyses/genomeMOTU/zhao2013/K61dR2_k45.scafSeq.proc Zhao2013 cp /home/douglas/scripted_analyses/COIevol/corrections/species_dense/insect3geneA insect3geneA # query names: primate_diet_cytb, Zhao2013 queryfile=Zhao2013 reference_alignment=insect3geneA rm $queryfile.pynast1 $queryfile.pynast2 pynast --pairwise_alignment_method clustal -i $queryfile -t $reference_alignment --min_pct_id=80 --min_len=300 --fasta_out_fp=$queryfile.pynast1 perl ~/usr_scripts/format_conversion.pl $queryfile.pynast1 $queryfile.pynast2 fasta fasta rm $queryfile.pynast2.andrefs # cat $reference_alignment.overlapping $queryfile.pynast2 > $queryfile.pynast2.andrefs cat $reference_alignment $queryfile.pynast2 > $queryfile.pynast2.andrefs # mothur alignment of cytb is so bad, there are some sequences longer than others. # pynast fails on zhou13 # mothur "#align.seqs(candidate=$queryfile, template=$reference_alignment, search=kmer, align=needleman, threshold=0.9)" ## try screen.seqs next ## change to fasta_missing_all format , changes non-standard gap char '.', used by mothur # rm Stahlhut2013.mothur # perl ~/usr_scripts/format_conversion.pl Stahlhut2013align Stahlhut2013.mothur fasta fasta_missing_all # actually, more complicated than that, the illumina seqs are contiguous genomic seqs, # cannot align them directly to a concatenated superamtrix, # try aligning seperatily to single genes, then concat both query and refs. # insect8geneB.COIonly, insect8geneB.16Sonly, insect8geneB.CYTBonly queryfile=primate_diet_cytb cp /home/douglas/scripted_analyses/COIevol/corrections/species_dense/insect8geneB.CYTBonly insect8geneB.CYTBonly reference_alignment=insect8geneB.CYTBonly rm $queryfile.$reference_alignment.pynast1 $queryfile.$reference_alignment.pynast2 pynast --pairwise_alignment_method clustal -i $queryfile -t $reference_alignment --min_pct_id=80 --min_len=300 --fasta_out_fp=$queryfile.$reference_alignment.pynast1 /home/douglas/software/sina/sina-1.2.11/sina -i $reference_alignment -o $reference_alignment.arb --prealigned --intype fasta /home/douglas/software/sina/sina-1.2.11/sina -i $queryfile -o $queryfile.$reference_alignment.sina --ptdb $reference_alignment.arb --intype fasta --outtype fasta perl ~/usr_scripts/format_conversion.pl $queryfile.$reference_alignment.sina $queryfile.$reference_alignment.sina2 fasta fasta_missing_all # ****, sina output is not even aligned. # finally, onto a winner, for cytb at least: perl ~/usr_scripts/change_tobycode.pl $queryfile clustalo-1.2.0 --profile1=$reference_alignment --in=$queryfile.recoded -outfile=$reference_alignment.$queryfile.CLO --seqtype=DNA --outfmt=fa --threads=4 --log=clustallogfile # $missing_data_char = "?";$remove_accession = 0;$required_data = ""; perl ~/usr_scripts/concatenate_v2.pl insect8geneB.COIonly $reference_alignment.$queryfile.CLO mv current_supermatrix2 $queryfile.pynast2.andrefs # ....... pynast crashes on these, and mothur freezes the computer. # dont have timne for this right now, might come back to this later # all coi query to ref alignemtns are fine, cytb alignment did not work, thus alignment to superamtrix is not appropriate, # instead will need to align to homolog only, then re-make superamtrix. same for zhou2013 except multigene queries. # actually, maybe save this multi-gene tree taxonomy for the next paper! # hopefully wont need this, using on reference taxa with a coi. #$include_nonoverlapping_taxa_in_fasta_file = 0; #perl ~/usr_scripts/parse_taxa_overlapping_newick_and_fasta.pl $referencetree $reference_alignment # branchlengths are lost when you do this (needed by pplacer). need to do another round of optimization! ...... cd /home/douglas/scripted_analyses/COIevol/classification/ ####################################################################################### # taxonomic placement start # PPLACER # you need to supply model information (RAxML_info...) for the references. # seems not to work with info file from raxmlHPC-8.1.2. euk16s study mostly used raxmlHPC-7.2.8-ALPHA # cd /home/douglas/scripted_analyses/COIevol/classification/pplacer/ # JAN16 reference_alignment=/home/douglas/scripted_analyses/COIevol/classification/insect8geneB.COIonly referencetree=/home/douglas/scripted_analyses/COIevol/classification/RAxML_bestTree.insect8geneC tax=(invDQ.Carabidae invDQ.Staphylinidae invDQ.Cucujiformia invDQ.Chironomidae invDQ.Muscoidea invDQ.Oestroidea invDQ.Apoidea invDQ.Obtectomera) queryfile=invDQ.Carabidae cp /home/douglas/scripted_analyses/COIevol/classification/$queryfile.pynast2 $queryfile.pynast2 # ludicrous problem, pplacer gets confused unless file name has .fa: cp $queryfile.pynast2.andrefs $queryfile.pynast2.andrefs.fa $ processed treee mihg t not be appropriate with info file, so use the initial tree output. endopD1.ft3 # endop: /home/douglas/scripted_analyses/COIevol/analysis/RAxML_info.endopE1.ft3b # insecta: /home/douglas/scripted_analyses/COIevol/corrections/species_dense/RAxML_info.insect8geneC info_file=/home/douglas/scripted_analyses/COIevol/corrections/species_dense/RAxML_info.insect8geneC perl ~/usr_scripts/raxml_v8_infofile_to_pplacer_input.pl $info_file RAxML_info.insect8geneC.pplacer_format # instead, manually take out non coi partitions. Using 8 distinct models/data partitions with joint branch length optimization # this is the info file from bl optimization of coi on fixed tree. v8. note there are subsequent trees raxml_info_file=RAxML_info.insect8geneC queryfile=invDQ.Carabidae #the_reference_alignment=/home/douglas/scripted_analyses/COIevol/analysis/endopCOI.pynast2.fas.accession_rm2 the_reference_alignment=/home/douglas/scripted_analyses/COIevol/analysis/endopCOI.pynast2.fas.accession_rm2.m_rm # has pruned rm # JAN16: the_reference_alignment=/home/douglas/scripted_analyses/COIevol/classification/insect8geneB.COIonly cat $the_reference_alignment $queryfile.pynast2 > pplacer_input.fa #cat $the_reference_alignment $queryfile.pynast2 > pplacer_input.fa cat $the_reference_alignment test1q > pplacer_input.fa old_infofile=RAxML_info.endopE1.ft3b.pplacer_format # prev:$queryfile.pynast2.andrefs.fa # problem with this tree:RAxML_result.endopD1.ft3.BL1 ... seems due to extra parentheses at root, and now unrooted by prune script. # still error msg:-m GTR --model-freqs --gamma-alpha 1 --ml-tolerance 0.1 --start-pend 0.05 --max-pend 0.2 --no-pre-mask --always-refine pplacer --mmap-file pplacer-mmap-file --keep-at-most 1 -t $referencetree -s $old_infofile pplacer_input.fa guppy-v1.1 fat pplacer_input.jplace queryfile=porter2014 #sequences for 10 references copied and used as queries. endopD1.ft3b pplacer --mmap-file pplacer-mmap-file --keep-at-most 1 -t $referencetree -s $raxml_info_file $queryfile.pynast2.andrefs.fa # out of memory. try --mmap-file; or run using ioz cluster cat /home/douglas/scripted_analyses/COIevol/analysis/endopCOI.pynast2.reducd.accession_rm2 test_queries > pplacer_input.fa pplacer --mmap-file pplacer-mmap-file --keep-at-most 1 -t RAxML_result.endopE1.ft3b -s RAxML_info.endopE1.ft3b.pplacer_format pplacer_input.fa for that guy cd /home/douglas/scripted_analyses/COIevol/classification/ rm reference_alignment reference_tree raxml_info cp /home/douglas/scripted_analyses/COIevol/analysis/endopCOI.pynast2.reducd.accession_rm2 reference_alignment # manually take last 10 seqs and put into file test_queries cp RAxML_result.endopE1.ft3b reference_tree cp RAxML_info.endopE1.ft3b.pplacer_format raxml_info virtualenv taxit-env source taxit-env/bin/activate pip install taxtastic taxit create -l COI -P COI.refpkg --aln-fasta reference_alignment --tree-stats raxml_info --tree-file reference_tree cat reference_alignment test_queries > pplacer_input.fa pplacer --mmap-file pplacer-mmap-file --keep-at-most 1 -c COI.refpkg pplacer_input.fa #'Warning: encountered -inf for final likelihood.' ############################################################################################################### ############################################################################################################### ***** RAXML EPA ***** cd /home/douglas/scripted_analyses/COIevol/classification/raxml/ # remember porter and gibson are the same dataset! # set file names #reference_alignment=/home/douglas/scripted_analyses/COIevol/analysis/endopCOI.pynast2.fas.accession_rm2.m_rm #referencetree=/home/douglas/scripted_analyses/COIevol/classification/RAxML_result.endopD1.ft3.BL1.treePL.b # Stahlhut2013 is upper limit on laptop, about 4000 queries in dereplicated set. # which takes a full overnight, uses all computer resources plus most the swap memory, # output files are printed very slowly afterwards, but completes eventually... just wait. # DEC2015: # benchmarks:invDQ.Heteroneura invDQ.Apocrita invDQ.Diptera invDQ.Coleoptera invDQ.Panheteroptera # benchmarksB:tax=(invDQ.Carabidae invDQ.Staphylinidae invDQ.Cucujiformia invDQ.Chironomidae invDQ.Muscoidea invDQ.Oestroidea invDQ.Apoidea invDQ.Obtectomera) # porter2014; Yu2012, Stahlhut2013, later, not sure which one of these:Shokralla2014_BP, reference_alignment=/home/douglas/scripted_analyses/COIevol/classification/insect8geneB.COIonly referencetree=/home/douglas/scripted_analyses/COIevol/classification/RAxML_bestTree.insect8geneC queryfile=Yu2012 # tax=(invDQ.Carabidae invDQ.Staphylinidae invDQ.Cucujiformia invDQ.Chironomidae invDQ.Muscoidea invDQ.Oestroidea invDQ.Apoidea invDQ.Obtectomera) cp /home/douglas/scripted_analyses/COIevol/classification/$queryfile.pynast2.andrefs $queryfile.pynast2.andrefs perl ~/usr_scripts/format_conversion.pl $queryfile.pynast2.andrefs $queryfile.pynast2.andrefs.phy fasta phylip # "-f v": classify a bunch of environmental sequences into a reference tree # -G EPA heuristics, fraction of insertion branches to be evaluated using slow insertions # reasonable settings for heuristic EPA threshold range between 0.015625 (1/64) and 0.5 (1/2). # 1000 queries; -G=0.1 takes 5hours. -G=0.5 takes 16hours # at -G 0.08, quality of asignments are worse than blast and MP-EPA, even though MP-EPA takes fraction of the time # reference tree is modified during epa, changes root, and branchlengths! # -o Acyrthosiphon_pisum rm R*$queryfile.raxmlEPA.G-0.12 # ***** tree output alignment for i in ${!tax[*]}; do queryfile=${tax[$i]};echo "number:$i ggroup:$queryfile" raxmlHPC-8.1.2 -f v -G 0.24 -c 4 -t $referencetree -m GTRCAT -n $queryfile.raxmlEPA.G-0.24 -s $queryfile.pynast2.andrefs.phy done # ***** # this script simply i) takes of QUERY___, ii) removes raxml branch labels. # file will be too big to do manually. #$basic_process = 1 perl /home/douglas/usr_scripts/newick/process_raxmlEPA_outtree.pl RAxML_labelledTree.$queryfile.raxmlEPA.G-0.5 # seqsfile is queries only, remember, different alignment software was used for different queries. # porter2014.pynast2; Gibson2014.pynast2; Yu2012.mothur; Stahlhut2013.mothur.derep; Shokralla2014_BP.mothur; invDailyQ.pynast2; raxml_classification=RAxML_classification.$queryfile.raxmlEPA.G-0.5 perl ~/usr_scripts/bagpipe_phylo.pl -treefile RAxML_labelledTree.$queryfile.raxmlEPA.G-0.5.procd -seqfile /home/douglas/scripted_analyses/COIevol/classification/$queryfile.pynast2 -raxml_classification $raxml_classification -support 0 -node 6960 # bug in bp_phylo, QDevia_prospera_943426511 and several others are correctly placed next to reference of same species, # but bp_phylo instead parses name Oxypoda_lentula , which is the species sister to these. # run on the cluster, copy fillse: for i in ${!tax[*]}; do queryfile=${tax[$i]};echo "number:$i ggroup:$queryfile" scp $queryfile.pynast2.andrefs.phy wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/$queryfile.pynast2.andrefs.phy done # porter2014; Yu2012, Stahlhut2013, primate_diet_cytb; later, not sure which one of these:Shokralla2014_BP, queryfile=primate_diet_cytb scp $queryfile.pynast2.andrefs.phy wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/$queryfile.pynast2.andrefs.phy scp /home/douglas/scripted_analyses/COIevol/classification/RAxML_bestTree.insect8geneC wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/RAxML_bestTree.insect8geneC scp /home/douglas/scripted_analyses/COIevol/docs/job_submit4 wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/job_submit4 ssh -l wei 10.0.0.181 # wei789 qsub job_submit4 # 109986.admin porter :133841.admin; yu 133842.admin; stahl:133843.admin; primatecytb:134104.admin; primaet2 :134303.admin # all raxml EPA results copied to here: cd /home/douglas/scripted_analyses/COIevol/classification/results/ scp wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/RAxML_labelledTree* . scp wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/RAxML_orig* . scp wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/RAxML_classification.* . scp wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/RAxML_classificationLikelihoodWeights* . scp wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/RAxML_entropy* . scp wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/RAxML_info* . scp wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/RAxML_*jplace . # RAXML EPA v2: AND you can classify using PARSIMONY, "-f y" # worked on test queries, keeps crashing on the six datasets rm RA*raxmlEPApars for i in ${!tax[*]}; do queryfile=${tax[$i]};echo "number:$i ggroup:$queryfile" # raxmlHPC-8.1.2 -f y -m GTRCAT -t $referencetree -n $queryfile.raxmlEPApars -s $queryfile.pynast2.andrefs.phy ## $basic_process = 1 # perl /home/douglas/usr_scripts/newick/process_raxmlEPA_outtree.pl RAxML_labelledTree.$queryfile.raxmlEPApars ## seqsfile is queries only perl ~/usr_scripts/bagpipe_phylo.pl -treefile RAxML_labelledTree.$queryfile.raxmlEPApars.procd -seqfile /home/douglas/scripted_analyses/COIevol/classification/$queryfile.pynast2 -support 0 -node -node 6960 done queryfile=primate_diet_cytb perl ~/usr_scripts/format_conversion.pl $queryfile.pynast2.andrefs $queryfile.pynast2.andrefs.phy fasta phylip raxmlHPC-8.2.4 -f y -m GTRCAT -t $referencetree -n $queryfile.raxmlEPApars -s $queryfile.pynast2.andrefs.phy # placements for printing images, need to be done on processed ultrametric tree. tax=(invDQ.Carabidae invDQ.Staphylinidae invDQ.Cucujiformia invDQ.Chironomidae invDQ.Muscoidea invDQ.Oestroidea invDQ.Apoidea invDQ.Obtectomera) for i in ${!tax[*]}; do queryfile=${tax[$i]};echo "number:$i gene:$gene" cat /home/douglas/scripted_analyses/COIevol/classification/$queryfile.pynast2 >> allqueryfiles.pynast2 done reference_alignment=/home/douglas/scripted_analyses/COIevol/classification/insect8geneB.COIonly cat $reference_alignment allqueryfiles.pynast2 > refs_plus_all_queries perl ~/usr_scripts/format_conversion.pl refs_plus_all_queries refs_plus_all_queries.phy fasta phylip try pkace on this tree: # reftree=/home/douglas/scripted_analyses/COIevol/corrections/species_dense/insect8geneC.rerooted.treePL # fails, # reference_alignment=/home/douglas/scripted_analyses/COIevol/classification/insect8geneB.COIonly referencetree=/home/douglas/scripted_analyses/COIevol/classification/RAxML_bestTree.insect8geneC rm R*all_queries.raxmlEPA.G-0.05 raxmlHPC-8.1.2 -f v -G 0.12 -c 4 -t $referencetree -m GTRCAT -n all_queries.raxmlEPA.G-0.12 -s refs_plus_all_queries.phy scp /home/douglas/scripted_analyses/COIevol/classification/raxml/refs_plus_all_queries.phy wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/refs_plus_all_queries.phy scp /home/douglas/scripted_analyses/COIevol/classification/raxml/submit5 wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/submit5 ssh -l wei 10.0.0.181 # wei789 qsub submit5 # 133653.admin scp wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/RAxML_labelledTree* . # need to filter poor assignments ... # maybe custom software to read raxml branch labels, filter, draw coloured tree in R. #$which_rank = 3 perl ~/usr_scripts/insert_family_name_into_newick.pl RAxML_labelledTree.porter2014.raxmlEPA.G-0.1.2 key_Oct2013_Insecta java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file /home/douglas/scripted_analyses/COIevol/analysis/RAxML_result.endopD1.ft3b.BL1 java -Xms64m -Xmx512m -jar /home/douglas/software/figtree/FigTree_v1.4.1/lib/figtree.jar $* # RAXML EPA v3: place querys in standard search, just fixing references. # cant use -f e, since need to swap queries, cant use -F since need brnach lengths #raxmlHPC-8.1.2 -F -D -e 1.0 -c 4 -g $referencetree -m GTRCAT -p 12345 -n $queryfile.raxml_fixed_refs -s $queryfile.pynast2.andrefs.phy -o Acyrthosiphon_pisum # thorough job doesnt take much longer, 4hrs at most. rm *raxmlFixedRefsThorough raxmlHPC-8.1.2 -c 12 -g $referencetree -m GTRCAT -p 12345 -n $queryfile.raxmlFixedRefsThorough -s $queryfile.pynast2.andrefs.phy -o Acyrthosiphon_pisum # if final optimization crashes, do that step again: raxmlHPC-8.1.2 -f e -c 12 -g RAxML_result.$queryfile.raxmlFixedRefsThorough -m GTRCAT -p 12345 -n $queryfile.raxmlFixedRefsThoroughB -s $queryfile.pynast2.andrefs.phy -o Acyrthosiphon_pisum perl ~/usr_scripts/bagpipe_phylo.pl -treefile RAxML_result.invDailyQ.raxmlFixedRefsThorough -seqfile /home/douglas/scripted_analyses/COIevol/classification/$queryfile.pynast2 -support 0 -node -node 50557 ############################################################################################################### ############################################################################################################### ********* hack FASTTREE EPA ********* # not possible to give reference tree as start tree and expect ft to as query taxa, it refuses. cd /home/douglas/scripted_analyses/COIevol/classification/fasttree_epa/ # idea: references are fully constrained, so should be branch swaps for queries only... # queries; porter2014; Gibson2014; Yu2012, Shokralla2014_BP, Stahlhut2013, invDailyQ queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDailyQ the_reference_alignment=/home/douglas/scripted_analyses/COIevol/analysis/endopCOI.pynast2.fas.accession_rm2.m_rm # JAN16: reference_alignment=/home/douglas/scripted_analyses/COIevol/classification/insect8geneB.COIonly referencetree=/home/douglas/scripted_analyses/COIevol/classification/RAxML_bestTree.insect8geneC queryfile=invDQ.Obtectomera tax=(invDQ.Carabidae invDQ.Staphylinidae invDQ.Cucujiformia invDQ.Chironomidae invDQ.Muscoidea invDQ.Oestroidea invDQ.Apoidea invDQ.Obtectomera) for i in ${!tax[*]}; do queryfile=${tax[$i]};echo "number:$i gene:$gene" cat /home/douglas/scripted_analyses/COIevol/classification/$queryfile.pynast2 >> allqueryfiles.pynast2 done rm ftEPA_input cat $reference_alignment allqueryfiles.pynast2 > ftEPA_input rm fasttree_format_constraint # fully constrained references uses 13gb and takes a week, ignoring constraints of 2or3 taxa uses about 8gb. #$process_outfile = 0; # -outgroup Acyrthosiphon_pisum perl ~/usr_scripts/fasttree_EPA.pl -node 6960 -seqfile ftEPA_input -treefile $referencetree # then run again with $process_outfile = 1, # reduce memory demand by permitting tip-level swaps, # setting $constrain_splits_above_this_many_terminals = 10; 15 hours, 4gb perl ~/usr_scripts/fasttree_EPA.pl -node 6960 -seqfile ftEPA_input -treefile $referencetree #fasttree_2.1.7 -log fasttreeEPA_logfile -nosupport -gtr -gamma -nt -constraints fasttree_format_constraint < ftEPA_input > $queryfile.ftEPAout # re-compiled with increased buffer size, 30,000 cd /home/douglas/software/fasttree/ rm FastTree gcc -O3 -finline-functions -funroll-loops -Wall -o FastTree FastTree.c -lm chmod +x FastTree cd /home/douglas/scripted_analyses/COIevol/classification/fasttree_epa/ /home/douglas/software/fasttree/FastTree -log fasttreeEPA_logfile -nosupport -gtr -gamma -nt -constraints fasttree_format_constraint.reduced < ftEPA_input > allqueries.ftEPAout #output is save in query folder mv /home/douglas/scripted_analyses/COIevol/classification/invDailyQ.ftEPAout invDailyQ.ftEPAout # maybe do ft epa on cluster: # this is fasttree with increased buffer for constraints:/home/douglas/software/fasttree/FastTree.c scp /home/douglas/software/fasttree/FastTree.c wei@10.0.0.181:/home/wei/DC/software/fasttree/FastTree.c scp fasttree_format_constraint.reduced wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/fasttree_format_constraint.reduced scp ftEPA_input wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/ftEPA_input log on the cluster, cd to the folder then gcc -O3 -finline-functions -funroll-loops -Wall -o FastTree FastTree.c -lm #PBS -N DC_raxml_1 #PBS -l nodes=1:ppn=1 #PBS -q big #PBS -d /home/wei/DC/insectTreeDec2015 #PBS -j oe /home/wei/DC/software/fasttree/FastTree -log ftEPAlog -nosupport -gtr -gamma -nt -constraints fasttree_format_constraint.reduced < ftEPA_input > ftEPAout # login: ssh -l wei 10.0.0.181 # wei789 qstat -f 133055.admin # copy results to local directory scp wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/R*insect8geneC . # screenout : scp wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/DC_raxml_1.o133055 . scp wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/ftEPAlog . scp wei@10.0.0.181:/home/wei/DC/insectTreeDec2015/ftEPAout . # get the names: # seqsfile is queries only perl ~/usr_scripts/bagpipe_phylo.pl -treefile ftEPAout -seqfile allqueryfiles.pynast2 -support 0 -node -node 6960 ############################################################################################################### ############################################################################################################### # **** BAGPIPE DISTANCE ****** cd /home/douglas/scripted_analyses/COIevol/classification/bagpipe_distance/ # $diet_group $diet_group_homologs $needle_command $threshold # raxmlHPC $queryfile.A.raxmlEPA.G-0.1 -s $queryfile.pynast2.andrefsA.phy -o Acyrthosiphon_pisum # set file names; porter2014; Gibson2014; Yu2012, Shokralla2014_BP, Stahlhut2013, invDailyQ # queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDailyQ # reference_alignment=/home/douglas/scripted_analyses/COIevol/analysis/endopCOI.pynast2.fas.accession_rm2.m_rm # DEC2015: # benchmarks:invDQ.Heteroneura invDQ.Apocrita invDQ.Diptera invDQ.Coleoptera invDQ.Panheteroptera # benchmarksB:tax=(invDQ.Carabidae invDQ.Staphylinidae invDQ.Cucujiformia invDQ.Chironomidae invDQ.Muscoidea invDQ.Oestroidea invDQ.Apoidea invDQ.Obtectomera) reference_alignment=/home/douglas/scripted_analyses/COIevol/classification/insect8geneB.COIonly queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDQ.Obtectomera bd_outfile=invDQ.Obtectomera.BDout # can use aligned file ($queryfile.pynast2) or unaligned ($queryfile) # use new script: # calculate distances between queries and reference, sequences are aligned already # is painfully slow on the 50,000 insect refs, although very low memory demand, can do all at same time. rm $bd_outfile perl ~/usr_scripts/bagpipe_distance.pl -dist_prealigned -query_file $queryfile.pynast2 -subject_file $reference_alignment -out_file $bd_outfile -threshold 0.85 # get taxonomies based on those distances rm $bd_outfile.tax perl ~/usr_scripts/bagpipe_distance.pl -tax_assign -dist_file $bd_outfile.pident2 -out_file $bd_outfile.tax -threshold 0.95 -node 6960 # output file has:1)query 2)shared_taxname 3)top_score 4)top_taxon 5)top_taxonomy ############################################################################################################### ############################################################################################################### # bagpipe regular # cd /home/douglas/scripted_analyses/COIevol/classification/bagpipe/ # check whether larger query groups produce prefereble results: cd /home/douglas/scripted_analyses/COIevol/classification/bagpipe/T85/ # for Bagpipe, average neibour linkage of carabidae queries at 90% blast similarity produced 53 clusters, # roughly corresponding to genera level groups. # for each query group up to ~ 200 homologs were retrieved from for COI where >90% similar, # queries and redferencea aligned then a bootstrapped raxml tree inferred under which taxonomies were assigned. # prevous pident for getting homoogs seems set too high (92), do with 85 simplified version of the pipeline was implemented in which steps adressing difficulties for length variable markers were ommited, for example homologo retreival usualy employs both b last and gap-corrected p-distances (counting internal gaps as single differences), a step onyl required for length variable markers. 85 percent is the default, higher values generating larger numbers of clusters and singltons, lower numbers becoming more difficult to align query groups vary greatly in number of homologs returned, known problem when using a standardized thresholds, in which some cases can return to many, slowing process, and others so few as to hinder normal construction of tree and taxonomic assignments a problem unique to subset assignment, and in the current instance this lead to faliure for assigning in XYZ species accuracy in original BP was 60% (albiet for length variable plant maker in which standard thresholds inappropriate) reference_alignment=/home/douglas/scripted_analyses/COIevol/classification/insect8geneB.COIonly # benchmarksB:tax=(invDQ.Carabidae invDQ.Staphylinidae invDQ.Cucujiformia invDQ.Chironomidae # invDQ.Muscoidea invDQ.Oestroidea invDQ.Apoidea invDQ.Obtectomera) queryfile=invDQ.Obtectomera cp /home/douglas/scripted_analyses/COIevol/classification/$queryfile.pynast2 $queryfile.pynast2 # make query groups according to pairwise similarlity # rm $queryfile.AAA # blastn-2.2.28+-64bit -task blastn -out $queryfile.AAA -evalue 1e-5 -subject /home/douglas/scripted_analyses/COIevol/classification/$queryfile.pynast2 -query /home/douglas/scripted_analyses/COIevol/classification/$queryfile.pynast2 -max_target_seqs 100000 -outfmt '6 sseqid qseqid evalue pident length' rm $queryfile.AAA.*.CLUSTERS perl ~/usr_scripts/single_linkage_1.02.pl $queryfile.AAA 85 0.5 # clustering 85%, 90%, 95% gives no clusters: Carabidae 11, 54 (3.9), 103; Staphylinidae 50, 112 (1.24), 120; # Cucujiformia 211, 300 (0.42), 334; Chironomidae 31, 109 (2.5), 133; Muscoidea 3, 30 (9), 110; # Oestroidea 2, 31 (14.5), 78; Apoidea 9, 20 (1.2), 28; Obtectomera 16, 148 (8.25), 383; # (804-333)/333=1.41, over all test datasets increasing threshold from 85 to 90 # increased number of clusters+singltons by 1.4x. # only need do following once for each reference reference_alignment=/home/douglas/scripted_analyses/COIevol/classification/insect8geneB.COIonly rm referencs* cp $reference_alignment referencs perl ~/usr_scripts/alignment_processing/unalign.pl referencs makeblastdb-2.2.28+-64bit -in referencs.unaligned -dbtype nucl -parse_seqids my $file_of_diet_clusters = "invDQ.Carabidae.AAA.T90.CLUSTERS"; my $fasta_file = "invDQ.Carabidae.pynast2"; my $psbAtrnH_DB = "referencs.unaligned"; my $blast_or_usearch = 1; # 1=blast, 2=usearch_6.0, 3=usearch_4 my $blast_command = "blastn-2.2.28+-64bit"; my $blastextra = "-max_target_seqs 200"; # "-num_descriptions 100000 -num_alignments 100000 "; my $usearch_command = "usearch4.2.66_i86linux32"; # usearch4.2.66_i86linux32 usearch6.0.307_i86linux32 my $usearch_id = 0.70; # if you are using usearch. it uses percent identity instead of evalue in global searches. put the percent identity value here. my $blast_parser_command = "~/usr_scripts/bagpipe/parse_uclust_searchresults.pl";# ~/usr_scripts/parse_uclust_searchresults.pl my $pident_cutoff = 90; # opens file current_diet_group, so cant run multiple instances rm $queryfile.query_group.* perl ~/usr_scripts/bagpipe/blast_diet_cluster.pl $queryfile.AAA.T90.CLUSTERS $queryfile.pynast2 referencs.unaligned $queryfile for i in {0..300};do echo "dietgroup $i combine each diet group with homologs"; cat $queryfile.query_group.$i.homologs $queryfile.query_group.$i.fas > $queryfile.query_group.$i.all_seqs mafft --retree 1 --maxiterate 0 $queryfile.query_group.$i.all_seqs > $queryfile.query_group.$i.aligned done rm RAx*.$queryfile.query_group.*.raxmlBS for i in {0..300};do echo "dietgroup $i make matrices"; perl ~/usr_scripts/dougseq.pl $queryfile.query_group.$i.aligned $queryfile.query_group.$i.aligned.phy fasta phylip # mpirun -n 4 raxmlHPC-7.2.8-ALPHA-MPI -s $queryfile.query_group.$i.aligned.phy -n $queryfile.query_group.$i.raxmlBS -m GTRCAT -c 4 -f a -x 12345 -p 12345 -# 100 raxmlHPC-8.2.4 -D -e 1.0 -s $queryfile.query_group.$i.aligned.phy -n $queryfile.query_group.$i.raxmlBS -m GTRCAT -c 4 -p 12345 done rm RAxML_bestTree.$queryfile.query_group.*.raxmlBS.query_clades for i in {0..300};do echo "dietgroup $i make matrices"; perl ~/usr_scripts/bagpipe_phylo.pl -treefile RAxML_bestTree.$queryfile.query_group.$i.raxmlBS -seqfile $queryfile.query_group.$i.fas -support 0 -node 6960 done cat RAxML_bestTree.$queryfile.query_group.*.raxmlBS.query_clades > $queryfile.BPresults # rm *.raxmlBS.query_clades RAxML_* diet_group.* ############################################################################################################### ############################################################################################################### ****** CLAIDENT ******* cd /home/douglas/scripted_analyses/COIevol/claident/ ## make taxonomy database: # file gi_taxid_nucl is massive, takes a while to download. wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz gzip -d gi_taxid_nucl.dmp.gz tar -xzf taxdump.tar.gz chmod 644 taxdump.tar.gz rm taxdump.tar.gz # Then, you need to make GenBank ID list file (gilist.txt) perl ~/scripted_analyses/COIevol/scripts/get_gi_list_from_fasta_IDs.pl /home/douglas/scripted_analyses/COIevol/analysis/endopCOI.pynast2.fas GIlist_endopCOI endopCOI.pynast2.gi # Finally, Claident Taxonomy DB can be made by the following command. # this takes at least an hour, also. # go up 1 dir, db is made in the folder you were just in. cd .. clmaketaxdb --gilist=/home/douglas/scripted_analyses/COIevol/claident/taxonomy/GIlist_endopCOI claident endopCOIclaident.taxdb # takes fasta file ids like:>gi|2218171, NO GAPS ALLOWED! # makes blast db in working directory: makeblastdb-2.2.28+-64bit -in endopCOI.pynast2.gi -dbtype nucl -parse_seqids # *** starting point for new query file, *** # run in claident folder! otherwise there will be confusion.. cd /home/douglas/scripted_analyses/COIevol/claident/ # porter2014; Gibson2014; Yu2012, Shokralla2014_BP, Stahlhut2013, invDailyQ queryfile=invDailyQ # run claident blast search for neighbours clidentseq blastn -strand both -word_size 8 -perc_identity 80 end --method=NNC --minalnlennn=240 --blastdb=endopCOI.pynast2.gi --numthreads=1 /home/douglas/scripted_analyses/COIevol/classification/$queryfile $queryfile.claident_blastout # claident_blastout has one line per query, load of gi numbers tab seperated on a line. like this: #QAcanthobrahmaea_europaea_819553920 345652284 613913941 #QAchaea_janata_808353681 575524899 320554689 # there are missing gi's for some reason, these need deleting from the clseq outfile or tax assign wont run # turns out there is only one missing!#GI:675359959 perl ~/usr_scripts/process_clidentseq_blastout.pl $queryfile.claident_blastout gi_taxid_nucl.dmp # running classigntax, you might get error message of gi's. # each time, add the error GI to the array in the script above. keep doing until it runs. # --maxpopposer= varies 0.001 to 0.99, although doesnt seem to affect species level idenits. --minnsupporter=1 or more. 1 results in more liberal sp assignment # --maxpopposer=0.001 # then: classigntax --minnsupporter=1 --taxdb=/home/douglas/scripted_analyses/COIevol/endopCOIclaident.taxdb $queryfile.claident_blastout $queryfile.claident_taxout ############################################################################################################### ############################################################################################################### ****** SAP ******* cd /home/douglas/scripted_analyses/COIevol/classification/sap/ # set file names; porter2014; Gibson2014; Yu2012, Shokralla2014_BP, Stahlhut2013, invDailyQ #queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDailyQ #reference_alignment=/home/douglas/scripted_analyses/COIevol/analysis/endopCOI.pynast2.fas.accession_rm2.m_rm # DEC2015: # benchmarks:invDQ.Heteroneura invDQ.Apocrita invDQ.Diptera invDQ.Coleoptera invDQ.Panheteroptera # benchmarksB:tax=(invDQ.Carabidae invDQ.Staphylinidae invDQ.Cucujiformia invDQ.Chironomidae invDQ.Muscoidea invDQ.Oestroidea invDQ.Apoidea invDQ.Obtectomera) # minidentity > 0.90 gave to insufficient homologs in certain cases, leading to unassigned queries even though they were in the references. queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDQ.Apocrita reference_alignment=/home/douglas/scripted_analyses/COIevol/classification/insect8geneB.COIonly sap_project_folder=invDQ.Apocrita.sap_project # very specific format for taxonomies in fasta ids # >DC1 ; order: Hemiptera, superfamily: Aphidoidea, family: Aphididae, subfamily: Aphidinae, tribe: Macrosiphini, genus: Acyrthosiphon, species: Acyrthosiphon pisum ; Acyrthosiphon pisum #--assignment ConstrainedNJ or Barcoder. rm db_sap.fasta # $filter_level = 2;$process_backbone_tree_terminal_IDs = 0; $truncate_new_taxonomic_strings = 0; # $fasta_ID_format = 1; # @ranksarray = (" order", " suborder", " superfamily", " family", " subfamily"," tribe", " genus", " subgenus", " species group"); # only need do this once for reference db perl ~/usr_scripts/species_to_complete_taxonomies.pl -seqfile $reference_alignment -output ref.sap_format -node 6960 -fasta # too slow on diptera (and thus coleop), change settings to speed up sap --maxblasthits 24 --alignmentlimit 24 --minidentity 0.85 --assignment ConstrainedNJ --database db_sap.fasta --project $sap_project_folder $queryfile # why is sap not aligning and tree building for these wqhic are in references:QScaphinotus_marginatus_943421078 # beacuse the conspecifics and congeners are about 90% identical (probably a species misidentifcation) # although this means 95% is way too high. CURRENT: queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDQ.Carabidae sap_project_folder=invDQ.Carabidae.sap_project sap --maxblasthits 24 --alignmentlimit 24 --minidentity 0.85 --assignment ConstrainedNJ --database db_sap.fasta --project $sap_project_folder $queryfile queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDQ.Staphylinidae sap_project_folder=invDQ.Staphylinidae.sap_project sap --maxblasthits 24 --alignmentlimit 24 --minidentity 0.85 --assignment ConstrainedNJ --database db_sap.fasta --project $sap_project_folder $queryfile queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDQ.Cucujiformia sap_project_folder=invDQ.Cucujiformia.sap_project sap --maxblasthits 24 --alignmentlimit 24 --minidentity 0.85 --assignment ConstrainedNJ --database db_sap.fasta --project $sap_project_folder $queryfile queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDQ.Chironomidae sap_project_folder=invDQ.Chironomidae.sap_project sap --maxblasthits 24 --alignmentlimit 24 --minidentity 0.85 --assignment ConstrainedNJ --database db_sap.fasta --project $sap_project_folder $queryfile queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDQ.Muscoidea sap_project_folder=invDQ.Muscoidea.sap_project sap --maxblasthits 24 --alignmentlimit 24 --minidentity 0.85 --assignment ConstrainedNJ --database db_sap.fasta --project $sap_project_folder $queryfile queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDQ.Oestroidea sap_project_folder=invDQ.Oestroidea.sap_project sap --maxblasthits 24 --alignmentlimit 24 --minidentity 0.85 --assignment ConstrainedNJ --database db_sap.fasta --project $sap_project_folder $queryfile queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDQ.Apoidea sap_project_folder=invDQ.Apoidea.sap_project sap --maxblasthits 24 --alignmentlimit 24 --minidentity 0.85 --assignment ConstrainedNJ --database db_sap.fasta --project $sap_project_folder $queryfile queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDQ.Obtectomera sap_project_folder=invDQ.Obtectomera.sap_project sap --maxblasthits 24 --alignmentlimit 24 --minidentity 0.85 --assignment ConstrainedNJ --database db_sap.fasta --project $sap_project_folder $queryfile ############################################################################################################### **** BLAST TOP HIT **** # staphys: # QAleochara_verna_943425590 vs Aleochara_verna # labeled same species but ~7% different # groups with Aleochara_brundini in the tree methods, which the query is slightly more similar to # QAleochara_verna_943425590 Aleochara_bipustulata 0.0 91.48 540 1 539 1 1 # QAleochara_verna_943425590 Aleochara_brundini 0.0 92.66 545 1 545 1 1 # QAleochara_verna_943425590 Aleochara_curtula 0.0 90.11 546 1 546 1 1 # QAleochara_verna_943425590 Aleochara_sekanai 0.0 90.11 546 1 546 1 1 # QAleochara_verna_943425590 Aleochara_verna 0.0 91.48 540 1 540 1 1 # QAleochara_verna_943425590 Hydrosmecta_gracilicornis 0.0 91.03 546 # so blast cutoff needs to be much lower! # other errors by tree-heuristic where blast is fine, # QDevia_prospera_943426511 somehow misses reference Devia_prospera, # instead Oxypoda_lentula with high significance , despite only ~90% similiarity # which seems to be a bug ... java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file /home/douglas/scripted_analyses/COIevol/classification/raxml/RAxML_labelledTree.invDQ.Staphylinidae.raxmlEPA.G-0.24.procd cd /home/douglas/scripted_analyses/COIevol/classification/blast/ # porter2014; Gibson2014; Yu2012, Shokralla2014_BP, Stahlhut2013, invDailyQ #refDB=endopCOI.pynast2.fas.accession_rm2.m_rm #queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDailyQ # DEC2015: # benchmarks:invDQ.Heteroneura invDQ.Apocrita invDQ.Diptera invDQ.Coleoptera invDQ.Panheteroptera # benchmarksB:tax=(invDQ.Carabidae invDQ.Staphylinidae invDQ.Cucujiformia invDQ.Chironomidae invDQ.Muscoidea invDQ.Oestroidea invDQ.Apoidea invDQ.Obtectomera) queryfile=/home/douglas/scripted_analyses/COIevol/classification/invDQ.Staphylinidae reference_alignment=/home/douglas/scripted_analyses/COIevol/classification/insect8geneB.COIonly results=invDQ.Heteroneura rm refDB* cp $reference_alignment refDB # makeblastdb doesnt like this fasta file, so use blast directly # heteroneura took 1.5 hrs rm $results.blastout blastn-2.2.28+-64bit -perc_identity 96 -task blastn -subject $reference_alignment -query $queryfile -out $results.blastout -word_size 12 -dust no -strand both -evalue 1e-10 -num_threads 1 -max_target_seqs 12 -outfmt '6 qseqid sseqid evalue pident length sstart send qframe sframe' rm *.blastout2 # or loop tax=(invDQ.Carabidae invDQ.Staphylinidae invDQ.Cucujiformia invDQ.Chironomidae invDQ.Muscoidea invDQ.Oestroidea invDQ.Apoidea invDQ.Obtectomera) for i in ${!tax[*]}; do queryfile=${tax[$i]};echo "number:$i ggroup:$queryfile" blastn-2.2.28+-64bit -perc_identity 95 -task blastn -subject $reference_alignment -query /home/douglas/scripted_analyses/COIevol/classification/$queryfile -out $queryfile.blastout2 -word_size 12 -dust no -strand both -evalue 1e-10 -num_threads 1 -max_target_seqs 12 -outfmt '6 qseqid sseqid evalue pident length sstart send qframe sframe' done ############################################################################################################### # is it possible to make a new perl script # to read ref tree, read ref seqs, read queries, and infer position on tree # get characters diagnostic for each node on the tree, then string search each query. ############################################################################################################### # get data for benchmarking from daily releases. cd /home/douglas/scripted_analyses/COIevol/benchmark/ # most the following is from /home/douglas/scripted_analyses/genomeMOTU/docs/genomeMOTU_NOTES, # used for tree: October 15 2015; NCBI-GenBank Flat File Release 210.0 # you need to adapt the regex for the dates you want, # for example used for tree oct 15, so you dont need dailys before oct15, so regex nc1\d+ to exclude rm nc*gz perl ~/scripted_analyses/genomeMOTU/scripts/download_dailys.pl # correct running requires these settings: # my $output_fasta_db_name = "invD"; # my @genbank_divisions = "nc0", $print_genome_taxa = 0, unhash:unless($in_hex == 1){$genome=1} # and unhash if(length($current_dna)<=150){$genome=1} # output is inv.daily rm invD # $limit_taxon =1;$limit_taxon_name = "Insecta"; # $output_ID_format = 2 (since these are very new, taxon IDs might not be in tax DB), perl ~/usr_scripts/create_fasta_database_from_genbank_flatfiles.pl cp /home/douglas/scripted_analyses/COIevol/corrections/species_dense/key_Oct2013_Insecta key_Oct2013_Insecta # settings, $id_format = 3 # rm invD.parse* # perl ~/usr_scripts/parse_taxon_from_fastafile.pl invD key_Oct2013_Insecta # process database, settings, 200-32,000; perl ~/usr_scripts/preprocess_fasta_database.pl invD coi_refs=/home/douglas/scripted_analyses/COIevol/corrections/species_dense/insectCOI.pynast2.reducd # /home/douglas/scripted_analyses/COIevol/analysis/endopCOI.pynast2.fas perl ~/usr_scripts/remove_overlapping_sequences.pl invD.ng.rr $coi_refs /home/douglas/scripted_analyses/COIevol/corrections/species_dense/ # reduce memory later, get COI out now... rm invD.ng.rr.rm.* makeblastdb-2.2.28+-64bit -in invD.ng.rr.rm -dbtype nucl -parse_seqids -max_file_sz 60000000 rm invDQ.blastout queryfile=/home/douglas/scripted_analyses/COIevol/corrections/species_dense/H03InsProf.muscle.fas blastn-2.2.28+-64bit -task blastn -db invD.ng.rr.rm -query $queryfile -out invDQ.blastout -word_size 10 -perc_identity 80 -dust no -strand both -evalue 1e-10 -num_threads 1 -max_target_seqs 200000 -outfmt '6 qseqid sseqid evalue pident length sstart send qframe sframe' # $dbtype = "nucl" perl ~/usr_scripts/parse_hits.pl invD.ng.rr.rm invDQ.blastout 60 1 400 2 blastdbcmd-2.2.28+-64bit # second go at making these test datasets. # for computational practicality you dont want more than a couple of thousand in each test datset # second thing, at this stage none binomails should be removed, they complicate testing later, # finally distributino is very skewed, # those particular species with such dispropropotionate data, sample somthing reasonble (10-20) rm invDQ.blastout.retreived.binoms perl ~/usr_scripts/remove_non_binomials_from_fasta.pl invDQ.blastout.retreived rm invDQ.blastout.retreived.binoms.normalized perl ~/usr_scripts/alignment_processing/sample_x_sequences_per_species.pl invDQ.blastout.retreived.binoms invDQ.blastout.retreived.binoms.normalized perl /home/douglas/usr_scripts/taxonomic_report.pl -seqfile invDQ.blastout.retreived.binoms.normalized -output invDQ.blastout.retreived.binoms.normalized.tax_report -node 6960 -fasta # paste output into spreadsheet # no_rank:Panheteroptera:132 (hemiptera) / order:Coleoptera:3533 / order:Diptera:13556 # /no_rank:Apocrita:133 / parvorder:Heteroneura:1054 (leps) # family:Carabidae:312; family:Staphylinidae:380; infraorder:Cucujiformia:878 # dips:family:Chironomidae:718; superfamily:Muscoidea:782; superfamily:Oestroidea:441 # superfamily:Apoidea:73; no_rank:Obtectomera:582 # taxa have been selected, all reasonble size, get them from the filtered benchamrk database tax=(Carabidae Staphylinidae Cucujiformia Chironomidae Muscoidea Oestroidea Apoidea Obtectomera) for i in ${!tax[*]}; do gene=${tax[$i]};echo "number:$i gene:$gene" perl ~/usr_scripts/select_taxon_members_from_fasta.pl invDQ.blastout.retreived.binoms.normalized $gene 6960 done # previous format for query ids into the pipeline:QAedes_aegypti_806476823 CGTGCT # so, manually put >Q in each query file, and gi numbers can be retained. # rename, then copy to taxonomic assignment folder # OLD:invDQ.Heteroneura invDQ.Apocrita invDQ.Diptera invDQ.Coleoptera invDQ.Panheteroptera tax=(invDQ.Carabidae invDQ.Staphylinidae invDQ.Cucujiformia invDQ.Chironomidae invDQ.Muscoidea invDQ.Oestroidea invDQ.Apoidea invDQ.Obtectomera) # copy to classification folder cd /home/douglas/scripted_analyses/COIevol/benchmark/ ############################################################################################################### ############################################################################################################### # TREE DRAWING #cd /home/douglas/scripted_analyses/COIevol/classification/ cd /home/douglas/scripted_analyses/COIevol/corrections/species_dense/ cd /home/douglas/scripted_analyses/COIevol/classification/results/ # set file name # porter2014; Yu2012; Shokralla2014_BP; Stahlhut2013; all_queries; primate_diet_cytb epa_results=Stahlhut2013 # confusing, very likely placment, although completely wrong and ver long pendant length. # thus need to filter according to both LWR and pendent length # gi|342776645|gb|JN286106.1|_Tersilochinae_sp._BOLDAAG2854_voucher_BIOUG09PROBE-A0193_ 8249 -7267051.560659 1.000000 0.802002 3.565791 # also need to get likelihood ratios : perl ~/usr_scripts/parse_raxml_jplace.pl RAxML_portableTree.$epa_results.raxmlEPA.G-0.12.jplace 0.999998 # run bagpipe phylo to get taxonomic groups to which queries have been assigned # $basic_process = 1; perl /home/douglas/usr_scripts/newick/process_raxmlEPA_outtree.pl RAxML_labelledTree.$epa_results.raxmlEPA.G-0.12 # cp /home/douglas/scripted_analyses/COIevol/classification/raxml/allqueryfiles.pynast2 all_queries.pynast2 cp /home/douglas/scripted_analyses/COIevol/classification/Stahlhut2013.mothur Stahlhut2013.pynast2 cp /home/douglas/scripted_analyses/COIevol/classification/primate_diet_cytb.recoded primate_diet_cytb.pynast2 perl ~/usr_scripts/bagpipe_phylo.pl -treefile RAxML_labelledTree.$epa_results.raxmlEPA.G-0.12.procd -seqfile $epa_results.pynast2 -raxml_jplace_parsed RAxML_portableTree.$epa_results.raxmlEPA.G-0.12.jplace.processed -support 0 -node 6960 # yu: perl ~/usr_scripts/bagpipe_phylo.pl -treefile RAxML_labelledTree.Yu2012.raxmlEPA.G-0.12.procd -seqfile yu.fas -raxml_jplace_parsed RAxML_portableTree.Yu2012.raxmlEPA.G-0.12.jplace.processed -support 0 -node 6960 # note, rax epa gives its I\d+ node labels to tips also, bear in mind when parsing results. # switch bl with node lable, so it can be rerooted. # also replace tip IDs with raxml tip labels, make key file for this. # $basic_process = 0;$terminal_format = 2 rm RAxML_originalLabelledTree.$epa_results.2 RAxML_originalLabelledTree.$epa_results.3 perl ~/usr_scripts/newick/process_raxmlEPA_outtree.pl RAxML_originalLabelledTree.$epa_results # outgroup is Acyrthosiphon_pisum, usually I16 (porter, gibson, yu, shokralla), check previous keyfile for this (XXX.3) # input is EPA tree output with no queries, needs rerooting # perl ~/usr_scripts/newick/root_with_bioperl.pl RAxML_originalLabelledTree.$epa_results.2 I16 # jan16, Acerentomon_microrhinus perl ~/usr_scripts/newick/root_with_bioperl.pl insect8geneC.treePL Acerentomon_microrhinus # need to redefine node IDs for major clades, with each new tree... # get node labels from EPA reference tree output, but dont use the tree, # also read in ultrametric tree, that one will be used for plotting. # rlt= raxml labelled tree; kf = keyfile; rc=raxml classification; pt=plotting tree # OLD endop: # perl ~/usr_scripts/read_newick_and_plot_tree.pl -rlt RAxML_originalLabelledTree.$epa_results.2.rerooted_at.I16 \ # -kf RAxML_originalLabelledTree.$epa_results.3 \ # -rc RAxML_classification.$epa_results \ # -pt /home/douglas/scripted_analyses/COIevol/analysis/RAxML_result.endopD1.ft3.BL1.treePL.b \ # -of $epa_results.png -rf $epa_results.r_commands # strategy, ignore previous setup, rectangualr tree, instead plot the circulr sp level tree, # and use the tip plotting function to plot hits. # porter2014; Yu2012; Shokralla2014_BP; Stahlhut2013; all_queries epa_results=porter2014 # perl /home/douglas/scripted_analyses/COIevol/scripts/colors_for_tree_labels.pl # JAN16: # $high_definition = 0 + 1; $use_keyfile = 0;$dataset = "STAHLHUT";# BENCHMARK; YU; PORTER; STAHLHUT perl ~/usr_scripts/read_newick_and_plot_tree.pl -plain -rlt null -kf NULL -rc null \ -pt insect8geneC.rerooted.treePL -qc RAxML_labelledTree.$epa_results.raxmlEPA.G-0.12.procd.query_clades \ -of species_level_tree_$epa_results.png -rf species_level_tree_JAN16.r_commands perl ~/usr_scripts/newick/root_with_bioperl.pl java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file RAxML_originalLabelledTree.porter2014TEST.raxmlEPA.G-0.02 # OLD .... plot species level tree, no queries. # perl ~/usr_scripts/read_newick_and_plot_tree.pl -plain -rlt null -kf RAxML_originalLabelledTree.$epa_results.3 -rc null \ # -pt /home/douglas/scripted_analyses/COIevol/analysis/RAxML_result.endopD1.ft3.BL1.treePL.b \ # -of species_level_tree3.png -rf species_level_tree3.r_commands # *** JAN16: # insect8geneC.treePL insect8geneC.rerooted.treePL # $high_definition = 0 + 1; $use_keyfile = 0; perl ~/usr_scripts/read_newick_and_plot_tree.pl -plain -rlt null -kf NULL -rc null \ -pt insect8geneC.rerooted.treePL \ -of species_level_tree_JUL16.png -rf species_level_tree_JUL16.r_commands # before that even. why dont any of these seem oriented as the nuclear tree?? perl ~/usr_scripts/read_newick_and_plot_tree.pl /home/douglas/scripted_analyses/COIevol/analysis/RAxML_result.endop3geneD1.ft # regular referece tree: java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file RAxML_result.endopD1.ft3 perl ~/scripted_analyses/COIevol/scripts/draw_mtgenome_tree.pl ~/scripted_analyses/COIevol/analysis/RAxML_bestTree.mtprot_constr.6_tax_rm.pruned java -Xms64m -Xmx512m -jar /home/douglas/software/figtree/FigTree_v1.4.1/lib/figtree.jar $* # list of collapse nodes: 22, 17, 8, 322, 326, 329, 330, 334, 337, 381, 350, 349, 357, 368, 366, 46, 44, 40, 56, 104, 98, 57, 58, 80, 62, 71, 124, 127, 314, 144, 129, 160, 162, 182, 177, 179, 174, 170, 301, 282, 197, 198, 204, 272 ############################################################################################################### # compare results , taxonomic assignment vis blast vs bagpipe vs SAP etc... cd /home/douglas/scripted_analyses/COIevol/classification/ # needs to be done mroe carefully, # make sure only to compare taxa that can be inferred using each method, # for example if a lineage is not parsed from a taxonomy db, is is not possible to find it with some methods, # even if its in the reference tree / alingmnet. some changes to the bagpipe software might be required/ # also consider differences between levels of reference presence/absense for test quesries ... # for query species absent in references, distance methods should perform more poorly, # thus test datasets with more query species absent, tree based apporaches should outperform. # also make sure you do actually have query test datasets with significant number of absent species! # this was all realised in early runs, where blast appeared twice as good as anything else, # was an artifact of the way taxa were (or were not) parsed in the other more compex methods. # epa placement errors: # /home/douglas/scripted_analyses/COIevol/corrections/figures/epa_placement_error.png # sap: valuesS=(0 20 40 60 70 80 90 92 94 96 98 100) # raxml: valuesR=(0.04 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.30 0.40 0.60 1.00) # raxmp: # values=(0.00001 0.0001 0.001 0.005 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.2 0.3 0.4 0.5 1.0) # bp_dist: # values=(0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 1.00) # blast: valuesB=(95 95.5 96 96.5 97 97.5 98 98.5 99 99.5 100 100) # old bagpipe valuesOBP=(0.06 0.08 0.09 0.10 0.11 0.12 0.14 0.18 0.20 0.30 0.40 0.60) # valuesFT=(0.08 0.09 0.10 0.11 0.12 0.14 0.18 0.20 0.24 0.30 0.40 0.60 1.00) # 1:sap_probability 2:raxml_EPA_classification_score 3:bpdist_similarity 4:blast_similarity # 70 0.05 0.02 0.96 96 for i in ${!valuesR[*]}; do current=${valuesR[$i]}; current2=${valuesB[$i]};currentSAP=${valuesS[$i]}; currentOBP=${valuesOBP[$i]};currentFT=${valuesFT[$i]}; echo "number:$i current_value:$current" perl ~/scripted_analyses/COIevol/scripts/analyse_assignments.pl $currentSAP $current 0.02 0 $current2 $currentOBP $currentFT done # 1 2 3 4 5 6 7 # $sap $raxml_EPA $raxmp $bpdist $blast $OLDbagpipe fasttree # carab: perl ~/scripted_analyses/COIevol/scripts/analyse_assignments.pl 96 0.18 0.1 0.1 0.1 0.12 0.12 # Cucujiformia perl ~/scripted_analyses/COIevol/scripts/analyse_assignments.pl 80 0.30 0.1 0.1 0.1 0.18 0.24 # Carabidae out of 312 queries, 228 are of species found in refs # RAXML:0.18 acc:0.705128; ft:0.12 acc:0.8205128; SAP_PR:96 acc:0.6826923; oldBP85:0.12 acc:0.82051; oldBP90_val:0.10 acc:0.830128 # Staphylinidae out of 380 queries, 330 are of species found in refs # RAXML_score:0.30 acc:0.91315; ft_cutoff:0.20 acc:0.8921052; SAP_PR:100 acc:0.42631; oldBP85_val:0.20 acc:0.8736842; oldBP90_val:0.14 acc:0.86052631 # Cucujiformia out of 878 queries, 646 are of species found in refs # RAXML_score:0.40 acc:0.89749; ft_cutoff:0.24 acc:0.884965; AP_PR:80 acc:0.4612756; olbp85 acc:0.7551252; oldbp90acc:0.746013 # Chironomidae out of 718 queries, 628 are of species found in refs # RAXML_score:0.30 acc:0.8119; ft_cutoff:0.30 acc:0.80083; SAP_PR:100 acc:0.448; oldBP85_val:0.30 acc:0.763231; oldBP90_val:0.30 acc:0.76323 # Muscoidea out of 782 queries, 745 are of species found in refs # RAXML_score:0.16 acc:0.89641; ft_cutoff:0.18 acc:0.909207; SAP_PR:60 acc:0.517902; oldBP85_val:0.18 acc:0.8976982; oldBP90_val:0.10 acc:0.8989 # Oestroidea out of 441 queries, 357 are of species found in refs # RAXML_score:0.12 acc:0.739229; ft_cutoff:0.11 acc:0.820861; SAP_PR:80 acc:0.5532; oldBP85_val:0.14 acc:0.8004535; oldBP90_val:0.12 acc:0.795918 # Apoidea out of 73 queries, 62 are of species found in refs # RAXML_score:0.12 acc:0.767123; ft_cutoff:0.18 acc:0.821917; SAP_PR:60 acc:0.80821917; oldBP85_val:0.18 acc:0.76712; oldBP90_val:0.14 acc:0.6438356 # Obtectomera out of 582 queries, 461 are of species found in refs # RAXML_score:0.10 acc:0.70618; ft_cutoff:0.11 acc:0.76116; SAP_PR:70 acc:0.75429; oldBP85_val:0.14 acc:0.7560137; oldBP90_val:0.11 acc:0.6683848 1 2 3 4 5 6 7 8 9 10 11 12 taxon_name queries in_refs raxml r_score ft f_score sap a_score bp85 b8score bp90 b9score cd /home/douglas/scripted_analyses/COIevol/corrections/results/ R t1<-read.table("assignment_benchmark_results", header=T, row.names=1) # wilcox.test(t1[,10] , t1[,12] , paired=TRUE) queries<-t1[,1];fasttree<-t1[,6]; raxml<-t1[,4]; bp85<-t1[,10]; bp90<-t1[,12]; sap<-t1[,8]; wilcox.test(fasttree , raxml, paired=TRUE) wilcox.test(fasttree , bp85, paired=TRUE) wilcox.test(fasttree , bp90, paired=TRUE) wilcox.test(fasttree , sap, paired=TRUE) wilcox.test(raxml , bp85, paired=TRUE) wilcox.test(raxml , bp90, paired=TRUE) wilcox.test(raxml , sap, paired=TRUE) wilcox.test(bp85 , bp90, paired=TRUE) wilcox.test(bp85 , sap, paired=TRUE) # weighted: wilcox.test(fasttree*queries , raxml*queries, paired=TRUE) wilcox.test(fasttree*queries , bp85*queries, paired=TRUE) XXXXXXwilcox.test(fasttree*queries , bp90*queries, paired=TRUE) wilcox.test(fasttree*queries , sap*queries, paired=TRUE) wilcox.test(raxml*queries , bp85*queries, paired=TRUE) XXXXXXwilcox.test(raxml*queries , bp90*queries, paired=TRUE) wilcox.test(raxml*queries , sap*queries, paired=TRUE) XXXXXXXXXwilcox.test(bp85*queries , bp90*queries, paired=TRUE) wilcox.test(bp85*queries , sap*queries, paired=TRUE) 0.0083 # kruskal.test(list(raxml, fasttree , sap , bp85) , paired=TRUE) sttree * queries and bp85 * queries V = 35, p-value = 0.01563 ata: fasttree * queries and sap * queries V = 36, p-value = 0.007813 bp85 * queries and sap * queries V = 34, p-value = 0.02344 generally, species assignments were more accurate with methods conducted in the context of a more complete phylogeny, singnificant improvemnts were observed between assignment on the whole tree versus individually for each query, (p=0.007813, weighted Wilcoxon signed rank test). additionally, where conducted on the whole tree versus grouped queries (fasttree vs bagpipe, p-value = 0.01563, ) or with grouped queries versus individual (bagpipe versus sap, p=0.02344) were indicative (a strict Bonferroni correction cutoff being 0.0083). Carabidae sap:61 oldBP:55; Staphylinidae Cucujiformia sap:204 oldBP:210 check_file=/home/douglas/scripted_analyses/COIevol/classification/raxml/RAxML_labelledTree.invDQ.Apoidea.raxmlEPA.G-0.5.procd java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file $check_file run bp_phylo on each of 6 result trees, it will give list of taxa. parse that list and summarize taxa, in conjuction with printing tree. descibe features, with mention of those descriptions in original publication. read_mewick and plot tree is a very complex script, best not to ask it to do even more, so via bagpipe phylo here ### note bagpipe plhylo prints to sceen list of counts for eaxh taxa, although needs work, e.g. doesnt give species count # porter2014; Gibson2014; Yu2012, Shokralla2014_BP, Stahlhut2013, invDailyQ queries=invDailyQ # gonna be awkward to rewite to work on jplace files due to BPp using regualr trees without node labels, # for now, just do the other way # seqfile is queries only perl ~/usr_scripts/bagpipe_phylo.pl -treefile invDailyQ.ftEPAout -seqfile /home/douglas/scripted_analyses/COIevol/classification/invDailyQ.pynast2 -support 0 -node -node 50557 ############################################################################################################### # phylocom # get phylogenetic div from the 3 samples in YU2012, and one sample of porter # recalculate from tree of all 4 samples, and from seperate trees of queries. # might wait till i get song mountain data and write a new paper on this cd /home/douglas/scripted_analyses/COIevol/phylocom/ sample abundence spcode E 17 OTU108 E 6 OTU109 E 1 OTU110 E ############################################################################################################### supplement: cp /home/douglas/scripted_analyses/COIevol/classification/RAxML_bestTree.insect8geneC /home/douglas/scripted_analyses/COIevol/corrections/supplement/species_level_tree.nwk cp ~/scripted_analyses/COIevol/corrections/transcriptomes/insectaNUCL.smatrix8 /home/douglas/scripted_analyses/COIevol/corrections/supplement/transcriptomes_supermatrix cp ~/scripted_analyses/COIevol/corrections/mtgenomes/InMT.smatrix /home/douglas/scripted_analyses/COIevol/corrections/supplement/mitogenome_supermatrix cp /home/douglas/scripted_analyses/COIevol/corrections/species_dense/insect8geneB /home/douglas/scripted_analyses/COIevol/corrections/supplement/specieslevel_supermatrix cd /home/douglas/scripted_analyses/COIevol/corrections/supplement/ perl ~/usr_scripts/format_conversion.pl transcriptomes_supermatrix transcriptomes_supermatrix.nex fasta nexus_concise perl ~/usr_scripts/format_conversion.pl mitogenome_supermatrix mitogenome_supermatrix.nex fasta nexus_concise perl ~/usr_scripts/format_conversion.pl specieslevel_supermatrix specieslevel_supermatrix.nex fasta nexus_concise gzip specieslevel_supermatrix.nex gzip transcriptomes_supermatrix.nex gzip mitogenome_supermatrix.nex ############################################################################################################### UNUSED: # before alignment, INSERT A SINGLE HEMIPTERA A. PISUM OUTGROUP for each gene # put them into files invCOI.blastout.retreived.ID_filtered, cytb..., 28S... COI: >Hemiptera_Acyrthosiphon_pisum >Acyrthosiphon_pisum_12345678 ATTTGATCAGGTATAATTGGATCTTCACTTAGAATTCTAATTCGTTTAGAATTAAGTCAAATTAATTCTATTATTAACAATAATCAATTATATAATGTAATTGTTACAATTCATGCTTTTATTATAATTTTTTTTATAACTATACCAATTGTAATTGGTGGATTTGGAAATTGATTAATTCCTATAATAATAGGATGTCCTGATATATCATTTCCTCGCTTAAATAATATTAGATTTTGATTATTACCTCCTTCATTAATAATAATAATTTGCAGTTTCTTAATTAATAATGGAACAGGAACAGGATGAACTATTTATCCACCTTTATCAAATAATATTGCACATAATAACATTTCAGTTGATTTAACTATTTTTTCTTTACACCTAGCAGGAATTTCATCAATTTTAGGAGCAATTAATTTTATTTGTACAATTCTTAATATAATACCTAATAACATAAAATTAAATCAAATTCCACTTTTCCCTTGATCAATTTTAATTACAGCTATCTTATTAATTTTATCTTTACCAGTTTTAGCTGGTGCTATTACAATATTATTAACTGATCGAAACTTAAATACATCATTTTTTGATCCAGCAGGAGGAGGAGATCCTATTTTA 28S: what you have in the insect profile: >hemiptera_gb|EU414724.1|:1-391 Cercopis vulnerata 28S ribosomal RNA gene, partial sequence tagtagctgcgagcgaaaggggaagagcccagcaccgaatcccgcgggccatgcccgcagggaaatgtggtgttcaggaggacccgattatcccggcgcccgtcag cgcgttcaagtccatcttgaatggggccactgcccgtagagggtgccaggcccgtagtgaccgcaggcgacgccgggtggggtctctccttagagtcgggttgcttgagagtgca gccctaagtgggtggtaaactccatctaaggctaaatacgaccacgagaccgatagcgaacaagtaccgtgagggaaagttgaaaagaactttgaagagagagttcaagagtacgtgaaa ccgttcaggggtaagcggaaaagacccgaaagttcgaacggggagattca OR a short A pisum which is fragment that matches that above >gi|299883962:953-1186 Acyrthosiphon pisum mRNA, clone: ACI0AAF41YH12, full-insert cDNA sequence based on ESTs (5'-EST:ACI0AAF41YH12AAM1, 3'-EST ACI2AAF6YJ10BBM1) >Acyrthosiphon_pisum_299883962 TCCGTTTACAACCGAGCGGTTTCACGTTCTTATGAACTCTCTCTTCAAAGTTCTTTTCAACTTTCCCTCACGGTACTTGTAAACTATCGGTCTCGTGGCCGTATTTAGCCTTAGATGGAGTTTACCACCCGCTTCGGGCT GCACTCTCAAGCAACCCGACTCGAAGGCGCGGTCCGATCCCGGAACGCTGTGACGGCCTCTACTGGCCTGGCACCATCTGCGGGCCATGGCCCC CYTB, something i found by blasting one of the primate diet seqs against a pisum: >gb|FJ411411.1|:10700-11100 Acyrthosiphon pisum strain 5A mitochondrion, complete genome >Acyrthosiphon_pisum_98765432 GAGGTGCTACAGTTATTACAAATCTTCTTTCAGCAATTCCATATCTAGGAAATTCAATTGTTATTTGAATTTGAGGAGGATTTTCAATTAATAATGCTACATTAACACGATTTTTCTCAATTCATTTTATTTTACCTTTT ATTATTATCTTATTTACTTTTATTCATCTTTTTTTTTTACATTTAACAGGTTCTAATAACCCATTAGGTATTAATAGAAACTTTGATAAAATTACATTTTCACCTTATTTTTTAATTAAAGATTTAATTGGATTAATTAT ATTTATATGAATATTTTTTATTTTAACTTTAATTTTTCCTTATTTATTAAATGATCATAATAATTTTATTATAGCAAACTCAATAATTACACCTAATCATATTCAACCAGAATGATATTTT # heres mothurs entry in the alignment hall of shame: >gi|339506613|gb|JF961096.1| .............ATTACTAATCTTGTGTCAGCTGTTCCATATATAGGTAAATTTTTAGTAGAGTGACTATGAGGAGGTTTTTCCATTAGAAACTCCACTCTAAACCGATTCTTTTCCTTACATTTTTTAATACCATTTTTACTGGTTTTTATAGTTATTATTCATATTTTATCATTACATACAACAGGATCAAGAAAACCAATTAATAACAACCCTAACAGAGATAAAATCCCATTTCATCCCTTTTTCTCTTTAAAAGATGTACCACCTATCCTGATAATTATACTAGTAATACTATTAATAAGAAACATTAG >gi|339506611|gb|JF961095.1| GAGGAGCAACTGTAATTACTAATCTTGTGTCAGCTGTTCCATATATAGGTAAATTTTTAGTAGAGTGACTATGAGGAGGTTTTTCCATTAGAAACTCCACTCTAAACCGATTCTTTTCCCTACATTTTTTAATACCATTTTTACTGGTTTTTATAGTTATTATTCATATTTTATCATTACATACAACAGGATCAAGAAAACCAATTAATAACAACCCTAACAGAGATAAAATCCCATTTCATCCCTTTTTCTCTTTAAAAGATGTACCACCTATCCTGATAATTATACTAGTAATACTATTAATAAGAAACATTAGCCC >gi|339506609|gb|JF961094.1| GAGGAGCAACTGTAATTACTAATCTTGTGTCAGCTGTTCCATATATGGGTAAATTTTTAGTAGAGTGACTATGAGGAGGTTTTTCCATTAGAAACTCCACTCTAAACCGATTCTTTTCCTTACATTTTTTAATACCATTTTTACTGGTTTTTATAGTTATTATTCATATTTTATCATTACATACAACAGGATCAAGAAAACCAATTAATAACAACCCTAACAGAGATAAAATCCCATTTCATCCCTTTTTCTCTTTAAAAGATGTACCACCTATCCTGATAATTATACTAGTAATACTATTAATAAGAAACATTAGCCC >gi|339506607|gb|JF961093.1| .............ATTACTAATCTTGTGTCAGCTGTTCCATATATAGGTAAATTTTTAGTAGAGTGACTATGAGGAGGTTTTTCCATTAGAAACTCCACTCTAAACCGATTCTTTTCCTTACATTTTTTAATACCATTTTTACTGGTTTTTATAGTTATTATTCATATTTTATCATTACATACAACAGGATCAAGAAAACCAATTAATAACAACCCTAACAGAGATAAAATCCCATTTCATCCCTTTTTCTCTTTAAAAGATGTACCACCTATCCTGATAATTATACTAGTAATACTATTAATAAGAAACATTAG >gi|339506605|gb|JF961092.1| .............ATTACTAATCTTGTGTCAGCTGTTCCATATATAGGTAAATTTTTAGTAGAGTGACTATGAGGAGGTTTTTCCATTAGAAACTCCACTCTAAACCGATTCTTTTCCTTACATTTTTTAATACCATTTTTACTGGTTTTTATAGTTATTATTCATATTTTATCATTACATACAACAGGATCAAGAAAACCAATTAATAACAACCCTAACAGAGATAAAATCCCATTTCATCCCTTTTTCTCTTTAAAAGATGTACCACCTATCCTGATAATTATACTAGTAATACTATTAATAAGAAACATTAG mt_matrix=/home/douglas/scripted_analyses/COIevol/corrections/mtgenomes/InMT.smatrix perl /home/douglas/usr_scripts/taxonomic_report.pl -seqfile $mt_matrix -output mtmat.tax_report -node 6960 -fasta