New DNA sequencing technologies are allowing researchers to explore the genomes of the millions of natural history specimens collected prior to the molecular era. Yet, we know little about how well specific next-generation sequencing (NGS) techniques work with the degraded DNA typically extracted from museum specimens. Here, we use one type of NGS approach, sequence capture of ultraconserved elements (UCEs), to collect data from bird museum specimens as old as 120 years. We targeted 5060 UCE loci in 27 western scrub-jays (Aphelocoma californica) representing three evolutionary lineages that could be species, and we collected an average of 3749 UCE loci containing 4460 single nucleotide polymorphisms (SNPs). Despite older specimens producing fewer and shorter loci in general, we collected thousands of markers from even the oldest specimens. More sequencing reads per individual helped to boost the number of UCE loci we recovered from older specimens, but more sequencing was not as successful at increasing the length of loci. We detected contamination in some samples and determined that contamination was more prevalent in older samples that were subject to less sequencing. For the phylogeny generated from concatenated UCE loci, contamination led to incorrect placement of some individuals. In contrast, a species tree constructed from SNPs called within UCE loci correctly placed individuals into three monophyletic groups, perhaps because of the stricter analytical procedures used for SNP calling. This study and other recent studies on the genomics of museum specimens have profound implications for natural history collections, where millions of older specimens should now be considered genomic resources.
trinity-assemblies-for-dryad.tar.gz
Raw reads assembled de novo into contigs using trinity (trinityrnaseq_r2013-02-25).
toepad-taxa.conf
The taxon-names configuration file for creating UCE data sets for use with the get_match_counts.py (https://github.com/faircloth-lab/phyluce).
toepad-taxa-COMPLETE.tar.gz
Untrimmed and trimmed Nexus files, PHYLIP concatenated alignment, and RAxML input/output for the 100% complete data matrix.
toepad-taxa-COMPLETE-RAxML_bipartitions.FINAL.tre
Newick-formatted tre file for the 100% complete concatenated matrix.
toepad-taxa-INCOMPLETE.tar.gz
Untrimmed and trimmed Nexus files, PHYLIP concatenated alignment, RAxML input/output for the incomplete data matrix. Includes directory of coverage computations for all UCE loci as well as loci in 75% complete data matrix.
toepad-taxa-INCOMPLETE-RAxML_bipartitions.FINAL
Newick-formatted tre file for the 75% complete concatenated matrix.
wesj-103135-reference-sequence.tar.gz
Quasi-reference sequence and sequences indexes for LACM-103135. Used for SNP-calling.
wesj-toepads-realigned.bam.placeholder
Merged BAM file that we realigned around indels using GATK. Contains alignment information for all individuals across which we called SNPs using GATK.
wesj-toepads-realigned.bam
wesj-toepads-rawSNPS-Q30.vcf.gz
Raw SNP calls output by GATK (i.e., not filtered using VariantFiltration).
wesj-toepads-inDels-Q30.vcf.gz
InDel calls using for VariantFiltration in GATK.
wesj-toepads-Q30-QD2-LOW-STRICT.vcf.gz
Filtered raw SNP data (i.e., these have been put through VariantFiltration).
Q30-QD2-MISS_0.75.recode.vcf.gz
Filtered SNPs that have undergone VariantFiltration and also selection for presence in 75% of individuals genotypes. SNPs in this file have further been filtered for the presence of informative sites.
Q30-QD2-MISS_0.75.recode.snapp.nexus.gz
Filtered SNPs that have undergone VariantFiltration and also selection for presence in 75% of individuals genotypes. SNPs in this file have further been filtered for the presence of informative sites. SNPs in this file have been re-formatted for SNAPP, and this file contains data for individuals USNM-144749 and MLZ-59834, which we removed from the data set prior to analysis.
Structure_infile_SNPS_jays.txt.gz
Filtered SNPs that have undergone VariantFiltration and also selection for presence in 75% of individuals genotypes. SNPs in this file have further been filtered for the presence of informative sites. Finally, SNPs in this file have been re-formatted for STRUCTURE.
SNAPP_analysis.tar.gz
Input files prepared for SNAPP and output files generated by SNAPP. The WESJ_SNAPP_TIPS_ed.xml input file created for SNAPP *does not* contain data for individuals USNM-144749 and MLZ-59834, which we removed from the data set prior to analysis.
Structure_Results_SNPs_jays.tar.gz
Output files generated by STRUCTURE.
toepad-taxa.ProcessingSteps.DRYAD.txt
Processing steps we used for analysis of these data.
SRA-locations.txt
Locations of raw sequence read files in the NCBI SRA.