The advent of next-generation sequencing technology has allowed for the collection of large portions of the genome for phylogenetic analysis. Hybrid enrichment and transcriptomics are two techniques that leverage next-generation sequencing and have shown much promise. However, methods for processing hybrid enrichment data are still limited. We developed a pipeline for anchored hybrid enrichment (AHE) read assembly, orthology determination, contamination screening, and data processing for sequences flanking the target “probe” region. We apply this approach to study the phylogeny of butterflies and moths (Lepidoptera), a megadiverse group of more than 157,000 described species with poorly understood deep-level phylogenetic relationships. We introduce a new, 855 locus anchored hybrid enrichment kit for Lepidoptera phylogenetics and compare resulting trees to those from transcriptomes. The enrichment kit was designed from existing genomes, transcriptomes and expressed sequence tag (EST) data and was used to capture sequence data from 54 species from 23 lepidopteran families. Phylogenies estimated from AHE data were largely congruent with trees generated from transcriptomes, with strong support for relationships at all but the deepest taxonomic levels. We combine AHE and transcriptomic data to generate a new Lepidoptera phylogeny, representing 76 exemplar species in 42 families. The tree provides robust support for many relationships, including those among the seven butterfly families. The addition of AHE data to an existing transcriptomic dataset lowers node support along the Lepidoptera backbone, but firmly places taxa with AHE data on the phylogeny. To examine the efficacy of AHE at different taxonomic levels, phylogenetic analyses were also conducted on a sister group representing a more recent divergence, the Saturniidae and Sphingidae. These analyses utilized sequences from the probe region and data flanking it, nearly doubled the size of the dataset; all resulting trees were well supported. We hope that our data processing pipeline, hybrid enrichment gene set, and approach of combining AHE data with transcriptomes will be useful for the broader systematics community.

README

README File containing list of files and script contained in this dryad package

Breinholt_et_al_Supplementary_Figure_S1

Supplementary Figure S1 from Breinholt et al. (2017)

Breinholt_et_al_Supplementary_Figure_S2

Supplementary Figure S2 from Breinholt et al. (2017)

Breinholt_et_al_Supplementary_Figure_S3

Supplementary Figure S3 from Breinholt et al. (2017)

Breinholt_et_al_Supplementary_Figure_S4

Supplementary Figure S4 from Breinholt et al. (2017)

Breinholt_et_al_Supplementary_Figure_S5

Supplementary Figure S5 from Breinholt et al. (2017)

Breinholt_et_al_Supplementary_File_1__S1-S11

Supplementary File 1: Microsoft excel document including Supplementary Table S1-S11 from Breinholt et al. (2017)

Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx

Breinholt_et_al_Supplementary_File_2_Lep1

Specification file for the Lep1 probe set used to order probes from Agilent Technologies (http://www.agilent.com/)

Breinholt_et_al_Supplementary_File_3

Word document that expands discussion of Breinholt et al. (2017) and discusses Lepidopteran relationships in more details

Lep1_ref

Compressed file containing the data for each reference for each loci in the Lep1 kit as well as used in the IBA assembly.

JAVA_SourceCode

Compressed directory holding A.R.L (alemmon@evotutor.org) java source code This directory contains readme and instructions for use and to compile the java code for IdentifySpacedKmers7, QuickScan5, and ShallowMapper4. It also contain the Lep1_ProbeDesign directory used with the java programs to design the Lep1 probe set (IdentifySpacedKmers7, IdentifySpacedKmers7_readme.txt, Lep1_ProbeDesign, LepRefFiles.txt, QuickScan5_readme.txt, QuickScan5.java, ShallowMapper4_readme.txt, ShallowMapper4.java) ShallowMapper4: java script by A.R.L used to identify intron boundaries in genes for five reference taxa by mapping raw genomic reads to the corresponding transcriptomic sequences QuickScan5: java script by A.R.L used to scan the additional 23 transcriptomes and ESTs by generating reference kmers using the 5-species alignments and using those kmers to map contig sequences from the transcriptomes to the candidate locus set

Breinholt_et_al_LOG_COMMANDS

Set of commands used to run the bioinformatic pipeline to generate data for Breinholt et al. 2017

Scripts_README

Description of the python scripts and direction how to run them.

IBA

python script to assemble AHE data loci by loci

IBA_trans

python script to assemble AHE data loci by loci for using a fastq file from transcriptome data

extract_probe_region

python script to split alignment into head, probe, and tail regions based on the beginning and end of a reference sequence in the alignment

s_hit_checker

python script to process the output of BLAST to find sequences that fit the single hit critera

ortholog_filter

python script to process the output of BLAST to find if the location of the best hit on the genome is the same location as the probe target from that genome.

split

python script to split a single line fasta file with many loci into locus specific fasta files

alignment_DE_trim

python script to trim alignments by density and entropy

flank_dropper

python script to remove poorly aligned sequences in the flanking head and tail regions

counting_monster

python script to count the loci per taxa and put into a tab separated matrix

removelist

python script to remove list of sequences from a fasta file

getlist

python script to get list of sequences from a fasta file

contamination_filter

python script to process blast results of blasting sequences from each loci against themselves using usearch to identify contamination

remove_duplicates

python script to identify and remove sequences for each taxon that had more than one sequence per locus

taxa_list

List of Sample ID's used in nexus files and corresponding species names in tab-delimited text

Breinholt_et_al_RAW_DATA.tar.gz

compressed file containing the raw Illumina (2X100) AHE data

Breinholtetal_RAW_DATA.tar.gz

final_soap_FG120036B

Assembly of Apatelodes pithala from Genbank SRA accession #SRR1794032, using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and processed with the fastx toolkit. See Breinholt et al. (2017) for more details.

final_soap_calo2

Assembly of Caloptilia triadicae from Genbank SRA accession #SRR1794032, using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and processed with the fastx toolkit. See Breinholt et al. (2017) for more details.

final_soap_GV120010B

Assembly of Urbanus proteus from Genbank SRA accession #SRR1794082 , using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and processed with the fastx toolkit. See Breinholt et al. (2017) for more details.

Breinholt_et_al_acrossLep_full_assemblies_all_loci

Fasta formatted sequence file containing sequences that pass pipeline step 1-6 for all loci and taxa in dataset 1-3. This file can be split using the split.py to separate into fasta files of individual loci.

Breinholt_et_al_shallow_full_assemblies_all_loci

Fasta formatted sequence file containing sequences that pass pipeline step 1-6 for all loci and taxa in dataset 4-6. This file can be split using the split.py to separate into fasta files of individual loci.

Breinholt_et_al_allcodonpostion123_acrossLep

Nexus file containing codon position 1 & 2 & 3 for 557 loci and 75 taxa used to make dataset 1-3. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.

Breinholt_et_al_degen12_DS1

Dataset 1 (acrossLEP_AHE). Nexus file containing codon position 1 & 2 for 557 loci and 23 taxa. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. Synonymous signal was removed using degen v1.4 Perl script (http://www.phylotools.com), and the third codon has been removed. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.

Breinholt_et_al_aminoacid_DS1

Nexus file containing codon position 1 & 2 for 557 loci and 23 taxa. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. Synonymous signal was removed using degen v1.4 Perl script (http://www.phylotools.com), and the third codon has been removed. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.

Breinholt_et_al_degen12_DS2

Dataset 2 (acrossLEP_AHE+PARTtrans). Nexus file containing codon position 1 & 2 for 557 loci and 75 taxa. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. Synonymous signal was removed using degen v1.4 Perl script (http://www.phylotools.com), and the third codon has been removed. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.

Breinholt_et_al_aminoacid_DS2

Nexus file containing amino acid data for 557 loci and 75 taxa. See taxa_list.txt for species names of each taxon, this is an amino acid nexus file with a CHARSET that defines each loci. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.

Breinholt_et_al_degen12_DS3

Dataset 3 (acrossLEP_AHE+ALLtrans ). Nexus file consists of both AHE and the transcriptomic data of Kawahara and Breinholt 2015. The file contains codon position 1 & 2 for 2948 loci and 76 taxa. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. Synonymous signal was removed using degen v1.4 Perl script (http://www.phylotools.com), and the third codon has been removed. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.

Breinholt_et_al_DS4

Dataset 4 (shallow_probe+flanks). Nexus file containing 749 loci and 48 taxa. Alignments were trimmed with a density of 60% and entropy of 1.5 using alignment_DE_trim.py and flacking regions were processed with the flank_dropper.py to remove head or tail sequences using 2 standard deviations for both the head and tail. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.

Breinholt_et_al_DS5

Dataset 5 (shallow_probe). Nexus file containing 749 loci and 48 taxa. The Extract_probe_region.py script was used on Dataset 4 to isolate data coming from the probe region. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.

Breinholt_et_al_DS6

Dataset 6 (shallow_flanks). Nexus file containing 749 loci and 35 taxa. The Extract_probe_region.py script was used on Dataset 4 to isolate data coming from the flanking regions region. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.

Data from: Resolving relationships among the megadiverse butterflies and moths with a novel pipeline for Anchored Phylogenomics

Data files

Abstract

Usage notes