Targeted enrichment of conserved genomic regions is a popular method for collecting large amounts of sequence data from non-model taxa for phylogenetic, phylogeographic and population genetic studies. For example, two available bait sets each allow enrichment of thousands of orthologous loci from >20 000 species (Faircloth et al. Systematic Biology, 61, 717–726, 2012; Molecular Ecology Resources, 15, 489–501, 2015). Unfortunately, few open-source workflows are available to identify conserved genomic elements shared among divergent taxa and to design enrichment baits targeting these regions. Those that do exist require extensive bioinformatics expertise and significant amounts of time to use. These shortcomings limit the application of targeted enrichment methods to additional organismal groups. Here, I describe a universal workflow for identifying conserved genomic regions in available genomic data and for designing targeted enrichment baits to collect data from these conserved regions. These methods require less expertise, less time and better use commonly available information to identify conserved loci and design baits to capture them. I apply this computational approach to the understudied arthropod groups Arachnida, Coleoptera, Diptera, Hemiptera or Lepidoptera to identify thousands of conserved loci in each group and design target enrichment baits to capture these loci. I then use in silico analyses to demonstrate that targeted enrichment of the conserved loci can be used to reconstruct the accepted relationships among genome sequences from the focal arthropod orders. The software workflow I created allowed me to identify thousands of conserved loci in five diverse arthropod groups and design sequence capture baits to target them. This suite of capture bait designs should enable collection of phylogenomic data from >900 000 arthropod species. Although the examples in this manuscript focus on understudied arthropod groups, the approach I describe is applicable to all organismal groups having some form of pre-existing genomic information (e.g. other invertebrates, plants, fungi and microbes). Finally, the documentation, design steps, software code and bait sets developed here are available under an open-source license for restriction-free testing, use, and additional modification by any research group.
Arachnida Bait Design and Testing Files
The ZIP archive contains the files used to identify, design, and test probes targeting conserved loci in Arachnids. The *-design-steps.md file is a description of the design steps followed for the group. The "BAM" directory contains mappings of real/simulated reads to the base genome sequence for the group. The "BED" directory contains BAM files converted to BED, as well as intermediate BED files created during BED processing. The "BED" directory also contains the database of putatively conserved loci in the exemplar+base genome taxa, and the temporary probe design file. The "*-probes" directory contains the database of temporary probe mappings to exemplar taxa, several intermediate files, and the principal probe design file. The "in-silico-test" directory contains all data from in-silico testing of the principal probe design file.
arachnida.zip
Arachnida-UCE-1.1K-v1 Bait Design
A target enrichment probe set designed from 1,120 UCE loci identified among arachnids. This is a ZIP archive of a FASTA file that is ready for submission of the probe set for synthesis.
Arachnida-UCE-1.1K-v1.fasta.zip
Coleoptera Bait Design and Testing Files
The ZIP archive contains the files used to identify, design, and test probes targeting conserved loci in Coleoptera. The *-design-steps.md file is a description of the design steps followed for the group. The "BAM" directory contains mappings of real/simulated reads to the base genome sequence for the group. The "BED" directory contains BAM files converted to BED, as well as intermediate BED files created during BED processing. The "BED" directory also contains the database of putatively conserved loci in the exemplar+base genome taxa, and the temporary probe design file. The "*-probes" directory contains the database of temporary probe mappings to exemplar taxa, several intermediate files, and the principal probe design file. The "in-silico-test" directory contains all data from in-silico testing of the principal probe design file.
coleoptera.zip
Coleoptera-UCE-1.1K-v1 Bait Design
A target enrichment probe set designed from 1,172 UCE loci identified among Coleoptera. This is a ZIP archive of a FASTA file that is ready for submission of the probe set for synthesis.
Coleoptera-UCE-1.1K-v1.fasta.zip
Diptera Bait Design and Testing Files
The ZIP archive contains the files used to identify, design, and test probes targeting conserved loci in Diptera. The *-design-steps.md file is a description of the design steps followed for the group. The "BAM" directory contains mappings of real/simulated reads to the base genome sequence for the group. The "BED" directory contains BAM files converted to BED, as well as intermediate BED files created during BED processing. The "BED" directory also contains the database of putatively conserved loci in the exemplar+base genome taxa, and the temporary probe design file. The "*-probes" directory contains the database of temporary probe mappings to exemplar taxa, several intermediate files, and the principal probe design file. The "in-silico-test" directory contains all data from in-silico testing of the principal probe design file.
diptera.zip
Diptera-UCE-2.7K-v1 Bait Design
A target enrichment probe set designed from 2,711 UCE loci identified among Diptera. This is a ZIP archive of a FASTA file that is ready for submission of the probe set for synthesis.
Diptera-UCE-2.7K-v1.fasta.zip
Hemiptera Bait Design and Testing Files
The ZIP archive contains the files used to identify, design, and test probes targeting conserved loci in Hemiptera. The *-design-steps.md file is a description of the design steps followed for the group. The "BAM" directory contains mappings of real/simulated reads to the base genome sequence for the group. The "BED" directory contains BAM files converted to BED, as well as intermediate BED files created during BED processing. The "BED" directory also contains the database of putatively conserved loci in the exemplar+base genome taxa, and the temporary probe design file. The "*-probes" directory contains the database of temporary probe mappings to exemplar taxa, several intermediate files, and the principal probe design file. The "in-silico-test" directory contains all data from in-silico testing of the principal probe design file.
hemiptera.zip
Hemiptera-UCE-2.7K-v1 Bait Design
A target enrichment probe set designed from 2,731 UCE loci identified among Hemiptera. This is a ZIP archive of a FASTA file that is ready for submission of the probe set for synthesis.
Hemiptera-UCE-2.7K-v1.fasta.zip
Lepidoptera Bait Design and Testing Files
The ZIP archive contains the files used to identify, design, and test probes targeting conserved loci in Lepidoptera. The *-design-steps.md file is a description of the design steps followed for the group. The "BAM" directory contains mappings of real/simulated reads to the base genome sequence for the group. The "BED" directory contains BAM files converted to BED, as well as intermediate BED files created during BED processing. The "BED" directory also contains the database of putatively conserved loci in the exemplar+base genome taxa, and the temporary probe design file. The "*-probes" directory contains the database of temporary probe mappings to exemplar taxa, several intermediate files, and the principal probe design file. The "in-silico-test" directory contains all data from in-silico testing of the principal probe design file.
lepidoptera.zip
Lepidoptera-UCE-1.3K-v1 Bait Design
A target enrichment probe set designed from 1,381 UCE loci identified among Lepidoptera. This is a ZIP archive of a FASTA file that is ready for submission of the probe set for synthesis.
Lepidoptera-UCE-1.3K-v1.fasta.zip