A snakemake toolkit for the batch assembly, annotation, and phylogenetic analysis of mitochondrial genomes and ribosomal genes from genome skims of museum collections
Data files
Oct 16, 2024 version files 329.72 KB
-
gene2phylo_config_solariellid.yaml
599 B
-
README.md
1.72 KB
-
skim2mito_config_solariellid.yaml
1.11 KB
-
skim2mito_gene.fasta
124.17 KB
-
skim2mito_samples_solariellid.csv
4.86 KB
-
skim2mito_seed.fasta
175.39 KB
-
skim2rrna_config_solariellid.yaml
937 B
-
skim2rrna_rRNA.fasta
15.21 KB
-
skim2rrna_samples_solariellid.csv
5.72 KB
Abstract
Low coverage “genome-skims” are often used to assemble organelle genomes and ribosomal gene sequences for cost-effective phylogenetic and barcoding studies. Natural history collections hold invaluable biological information, yet poor preservation resulting in degraded DNA often hinders PCR based analyses. However, it is possible to generate libraries and sequence the short fragments typical of degraded DNA to generate genome-skims from museum collections.
Here we introduce a snakemake toolkit comprised of three pipelines skim2mito, skim2rrna and gene2phylo, designed to unlock the genomic potential of historical museum specimens using genome skimming. Specifically, skim2mito and skim2rrna perform the batch assembly, annotation and phylogenetic analysis of mitochondrial genomes and nuclear ribosomal genes, respectively, from low-coverage genome skims. The third pipeline gene2phylo takes a set of gene alignments and performs phylogenetic analysis of individual genes, partitioned analysis of concatenated alignments and a phylogenetic analysis based on gene trees.
We benchmark our pipelines with simulated data, followed by testing with a novel genome skimming dataset from both recent and historical solariellid gastropod samples. We show that the toolkit can recover mitochondrial and ribosomal genes from poorly preserved museum specimens of the gastropod family Solariellidae, and the phylogenetic analysis is consistent with our current understanding of taxonomic relationships.
The generation of bioinformatic pipelines that facilitate processing large quantities of sequence data from the vast repository of specimens held in natural history museum collections will greatly aid species discovery and exploration of biodiversity over time, ultimately aiding conservation efforts in the face of a changing planet.
README: A snakemake toolkit for the batch assembly, annotation, and phylogenetic analysis of mitochondrial genomes and ribosomal genes from genome skims of museum collections
https://doi.org/10.5061/dryad.h70rxwdt2
Description of the data and file structure
This data includes the configuration files and reference data used for the analyses of genome skimming data from Solariellidae museum collections.
Files and variables
File: skim2mito_config_solariellid.yaml
Description: Config file used for the skim2mito pipeline.
File: skim2mito_gene.fasta
Description: Reference gene fasta used for the GetOrganelle step of the pipeline.
File: skim2mito_samples_solariellid.csv
Description: Samples file used for the skim2mito pipeline.
Variables
- ID: Sample ID
- forward: Path to forward reads
- reverse: Path to reverse reads
- taxid: NCBI taxon ID
- seed: Path to GetOrganelle seed reference
- gene: Path to GetOrganelle gene reference
File: skim2mito_seed.fasta
Description: Reference seed fasta used for the GetOrganelle step of the pipeline.
File: skim2rrna_config_solariellid.yaml
Description: Config file used for the skim2rrna pipeline.
File: skim2rrna_samples_solariellid.csv
Description: Samples file used for the skim2rrna pipeline.
Variables
- ID: Sample ID
- forward: Path to forward reads
- reverse: Path to reverse reads
- taxid: NCBI taxon ID
- seed: Path to GetOrganelle seed reference
- gene: Path to GetOrganelle gene reference
File: skim2rrna_rRNA.fasta
Description: Reference gene and seed fasta used for the GetOrganelle step of the pipeline.