The advent of next-generation sequencing technology has allowed for the collection of large portions of the genome for phylogenetic analysis. Hybrid enrichment and transcriptomics are two techniques that leverage next-generation sequencing and have shown much promise. However, methods for processing hybrid enrichment data are still limited. We developed a pipeline for anchored hybrid enrichment (AHE) read assembly, orthology determination, contamination screening, and data processing for sequences flanking the target “probe” region. We apply this approach to study the phylogeny of butterflies and moths (Lepidoptera), a megadiverse group of more than 157,000 described species with poorly understood deep-level phylogenetic relationships. We introduce a new, 855 locus anchored hybrid enrichment kit for Lepidoptera phylogenetics and compare resulting trees to those from transcriptomes. The enrichment kit was designed from existing genomes, transcriptomes and expressed sequence tag (EST) data and was used to capture sequence data from 54 species from 23 lepidopteran families. Phylogenies estimated from AHE data were largely congruent with trees generated from transcriptomes, with strong support for relationships at all but the deepest taxonomic levels. We combine AHE and transcriptomic data to generate a new Lepidoptera phylogeny, representing 76 exemplar species in 42 families. The tree provides robust support for many relationships, including those among the seven butterfly families. The addition of AHE data to an existing transcriptomic dataset lowers node support along the Lepidoptera backbone, but firmly places taxa with AHE data on the phylogeny. To examine the efficacy of AHE at different taxonomic levels, phylogenetic analyses were also conducted on a sister group representing a more recent divergence, the Saturniidae and Sphingidae. These analyses utilized sequences from the probe region and data flanking it, nearly doubled the size of the dataset; all resulting trees were well supported. We hope that our data processing pipeline, hybrid enrichment gene set, and approach of combining AHE data with transcriptomes will be useful for the broader systematics community.
README
README File containing list of files and script contained in this dryad package
Breinholt_et_al_Supplementary_Figure_S1
Supplementary Figure S1 from Breinholt et al. (2017)
Breinholt_et_al_Supplementary_Figure_S2
Supplementary Figure S2 from Breinholt et al. (2017)
Breinholt_et_al_Supplementary_Figure_S3
Supplementary Figure S3 from Breinholt et al. (2017)
Breinholt_et_al_Supplementary_Figure_S4
Supplementary Figure S4 from Breinholt et al. (2017)
Breinholt_et_al_Supplementary_Figure_S5
Supplementary Figure S5 from Breinholt et al. (2017)
Breinholt_et_al_Supplementary_File_1__S1-S11
Supplementary File 1: Microsoft excel document including Supplementary Table S1-S11 from Breinholt et al. (2017)
Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx
Breinholt_et_al_Supplementary_File_2_Lep1
Specification file for the Lep1 probe set used to order probes from Agilent Technologies (http://www.agilent.com/)
Breinholt_et_al_Supplementary_File_3
Word document that expands discussion of Breinholt et al. (2017) and discusses Lepidopteran relationships in more details
Lep1_ref
Compressed file containing the data for each reference for each loci in the Lep1 kit as well as used in the IBA assembly.
JAVA_SourceCode
Compressed directory holding A.R.L (alemmon@evotutor.org) java source code
This directory contains readme and instructions for use and to compile the java code for IdentifySpacedKmers7, QuickScan5, and ShallowMapper4. It also contain the Lep1_ProbeDesign directory used with the java programs to design the Lep1 probe set
(IdentifySpacedKmers7, IdentifySpacedKmers7_readme.txt, Lep1_ProbeDesign, LepRefFiles.txt, QuickScan5_readme.txt, QuickScan5.java, ShallowMapper4_readme.txt, ShallowMapper4.java)
ShallowMapper4: java script by A.R.L used to identify intron boundaries in genes for five reference taxa by mapping raw genomic reads to the corresponding transcriptomic sequences
QuickScan5: java script by A.R.L used to scan the additional 23 transcriptomes and ESTs by generating reference kmers using the 5-species alignments and using those kmers to map contig sequences from the transcriptomes to the candidate locus set
Breinholt_et_al_LOG_COMMANDS
Set of commands used to run the bioinformatic pipeline to generate data for Breinholt et al. 2017
Scripts_README
Description of the python scripts and direction how to run them.
IBA
python script to assemble AHE data loci by loci
IBA_trans
python script to assemble AHE data loci by loci for using a fastq file from transcriptome data
extract_probe_region
python script to split alignment into head, probe, and tail regions based on the beginning and end of a reference sequence in the alignment
s_hit_checker
python script to process the output of BLAST to find sequences that fit the single hit critera
ortholog_filter
python script to process the output of BLAST to find if the location of the best hit on the genome is the same location as the probe target from that genome.
split
python script to split a single line fasta file with many loci into locus specific fasta files
alignment_DE_trim
python script to trim alignments by density and entropy
flank_dropper
python script to remove poorly aligned sequences in the flanking head and tail regions
counting_monster
python script to count the loci per taxa and put into a tab separated matrix
removelist
python script to remove list of sequences from a fasta file
getlist
python script to get list of sequences from a fasta file
contamination_filter
python script to process blast results of blasting sequences from each loci against themselves using usearch to identify contamination
remove_duplicates
python script to identify and remove sequences for each taxon that had more than one sequence per locus
taxa_list
List of Sample ID's used in nexus files and corresponding species names in tab-delimited text
Breinholt_et_al_RAW_DATA.tar.gz
compressed file containing the raw Illumina (2X100) AHE data
Breinholtetal_RAW_DATA.tar.gz
final_soap_FG120036B
Assembly of Apatelodes pithala from Genbank SRA accession #SRR1794032, using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and processed with the fastx toolkit. See Breinholt et al. (2017) for more details.
final_soap_calo2
Assembly of Caloptilia triadicae from Genbank SRA accession #SRR1794032, using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and processed with the fastx toolkit. See Breinholt et al. (2017) for more details.
final_soap_GV120010B
Assembly of Urbanus proteus from Genbank SRA accession #SRR1794082 , using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and processed with the fastx toolkit. See Breinholt et al. (2017) for more details.
Breinholt_et_al_acrossLep_full_assemblies_all_loci
Fasta formatted sequence file containing sequences that pass pipeline step 1-6 for all loci and taxa in dataset 1-3. This file can be split using the split.py to separate into fasta files of individual loci.
Breinholt_et_al_shallow_full_assemblies_all_loci
Fasta formatted sequence file containing sequences that pass pipeline step 1-6 for all loci and taxa in dataset 4-6. This file can be split using the split.py to separate into fasta files of individual loci.
Breinholt_et_al_allcodonpostion123_acrossLep
Nexus file containing codon position 1 & 2 & 3 for 557 loci and 75 taxa used to make dataset 1-3. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.
Breinholt_et_al_degen12_DS1
Dataset 1 (acrossLEP_AHE). Nexus file containing codon position 1 & 2 for 557 loci and 23 taxa. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. Synonymous signal was removed using degen v1.4 Perl script (http://www.phylotools.com), and the third codon has been removed. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.
Breinholt_et_al_aminoacid_DS1
Nexus file containing codon position 1 & 2 for 557 loci and 23 taxa. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. Synonymous signal was removed using degen v1.4 Perl script (http://www.phylotools.com), and the third codon has been removed. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.
Breinholt_et_al_degen12_DS2
Dataset 2 (acrossLEP_AHE+PARTtrans). Nexus file containing codon position 1 & 2 for 557 loci and 75 taxa. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. Synonymous signal was removed using degen v1.4 Perl script (http://www.phylotools.com), and the third codon has been removed. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.
Breinholt_et_al_aminoacid_DS2
Nexus file containing amino acid data for 557 loci and 75 taxa. See taxa_list.txt for species names of each taxon, this is an amino acid nexus file with a CHARSET that defines each loci. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.
Breinholt_et_al_degen12_DS3
Dataset 3 (acrossLEP_AHE+ALLtrans ). Nexus file consists of both AHE and the transcriptomic data of Kawahara and Breinholt 2015. The file contains codon position 1 & 2 for 2948 loci and 76 taxa. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. Synonymous signal was removed using degen v1.4 Perl script (http://www.phylotools.com), and the third codon has been removed. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.
Breinholt_et_al_DS4
Dataset 4 (shallow_probe+flanks). Nexus file containing 749 loci and 48 taxa. Alignments were trimmed with a density of 60% and entropy of 1.5 using alignment_DE_trim.py and flacking regions were processed with the flank_dropper.py to remove head or tail sequences using 2 standard deviations for both the head and tail. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.
Breinholt_et_al_DS5
Dataset 5 (shallow_probe). Nexus file containing 749 loci and 48 taxa. The Extract_probe_region.py script was used on Dataset 4 to isolate data coming from the probe region. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.
Breinholt_et_al_DS6
Dataset 6 (shallow_flanks). Nexus file containing 749 loci and 35 taxa. The Extract_probe_region.py script was used on Dataset 4 to isolate data coming from the flanking regions region. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.