Skip to main content
Dryad

Files used to develop and test SLAG, Seeded Local Assembly of Genes

Cite this dataset

Crane, Charles et al. (2021). Files used to develop and test SLAG, Seeded Local Assembly of Genes [Dataset]. Dryad. https://doi.org/10.5061/dryad.0p2ngf22s

Abstract

This dataset consists of configuration files and results files involved in developing and testing SLAG, an iterative Perl pipeline to construct local assemblies seeded on query sequences.  SLAG is intended for situations where a full genome assembly is economically or technically infeasible, and thus it emphasizes methods that can deal with shallow read depth.  SLAG and its user-runnable test suite have been deposited at https://github.com/cfcrane/SLAG.  This Dryad repository contains files that were used in preparing a manuscript, "SLAG: A Program for Seeded Local Assembly of Genes in Complex Genomes", which has been submitted to Molecular Ecology Resources.  The repository is divided into sections for scripts, simulated reads, real reads, and benchmarking against two functionally similar programs, aTRAM2 and SRAssembler.

Methods

SLAG was initially developed in 2012 under Mac OS X and further developed under RHEL5.6 and CentOS7 Linux on supercomputing clusters Coates, Halstead, and Brown at Purdue University.  Benchmarking was performed exclusively on one queue (not always the same node) on the Brown cluster.  Each benchmarking run was run by itself and was given exclusive access to the memory and 24 cores on the node.  Non-benchmarking runs shared resources with other jobs.  These trials looked at contig count, contig length, and percentage match to a reference genome from which the reads had been derived.

Usage notes

      This repository is a gzipped "tarball" that can be downloaded, uncompressed, and expanded to reveal a directory tree rooted on todryad, with subdirectories simulations, realdata, benchmarking, and scripts.  The simulations involved simulated reads extracted from random sites in three homoeologous regions of chromosomes 1A, 1B, and 1D in the International Wheat Genome Sequencing Consortium version 2.0 assembly of wheat variety 'Chinese Spring'; see the deflines in simulations/configs/cutoutregionforreadsimulations.fasta for coordinates.  The simulated reads fell into three groups by length, short (2x150 bp paired ends with error frequencies similar to Illumina), middle (400-600 bp with low error frequencies), and long (3500 - 14000 bp with ca. 10% error frequencies similar to Oxford Nanopore several years ago).  These groups are denoted as SR, MR, and LR in the configuration and resulting subset.contigs files. The latter contain local assemblies that match the seeding sequence, and scripts collected information about contig count, length, and percent nucleotide matching to the parent genome assembly as found by blastn alignment. 

      The real reads were quality-filtered from 18 Genbank SRA accessions, ERR3288286 through ERR3288295, which are Pacbio reads from maize B73, and ERR3288215 through ERR328818, which are paired-end Illumina reads from maize B73. The resulting local assemblies were aligned to the published B73 genome (Genbank accession GCF_000005005.2). 

      The benchmarking directory includes local assemblies produced with the simulated wheat reads and real maize Illumina reads against a panel of 10 maize enzymes, where each enzyme was represented by 2-7 accessions from Genbank nr: cellulose synthase (NP_001104955.2, NP_001104956.2, NP_001104959.2, NP_001105236.2, NP_001105574.1, NP_001105672.1, NP_001292792.1), ferredoxin (NP_001104851.1, NP_001136908.1, NP_001150750.1, NP_001336742.1, XP_020394593.1, XP_020405634.1), hexokinase (NP_001123599.1, XP_008672065.1, XP_008674565.1, XP_008675068.1), histone deacetylase (NP_001104901.1, NP_001105402.2, XP_008673398.1, XP_008677775.1, XP_020396306.1), isocitrate dehydrogenase (AQK53344.1, AQK89292.1, AQK97039.1, AQK88693.1, NP_001295424.1, ONM16007.1, ONM58401.1), peptidylprolylisomerase (AQK62104.1, AQK70996.1, AQL06400.1, ONM03151.1, ONM04876.1, ONM54033.1), phosphoglucoisomerase (NP_001105368.1, XP_008651420.1), phosphoglucomutase (NP_001105405.1, NP_001105703.1, XP_008675355.1, XP_020395615.1), sucrose synthase (XP_008645119.1, XP_008679107.1, XP_020399433.1, XP_023156234.1), and transaminase (NP_001149818.2, NP_001278682.1, XP_008645517.1, XP_008668890.1, XP_008672129.1).  Runtimes and memory usage were recorded and output with Slurm utility sacct in the file benchmarking/runtimes/benchmarkrunrecords.txt, which covers the interval from 31 January 2021 through 10 June 2021.  Benchmarking jobs were systematically named with structure AABBCCCDDD.sh, where AA is one of ra (aTRAM2), rs (SLAG), or rr (SRAssembler); BB is Ta for wheat or Zm for maize; CCC is three letters for the enzyme activity (cel = cellulose synthase, fer = ferredoxin, hex = hexokinase, his = histone deacetylase, iso = isocitrate dehydrogenase, pep = peptidylprolylisomerase, pgi = phosphoglucoisomerase, pgm = phosphoglucomutase, suc = sucrose synthase, tra = transaminase); and DDD is three letters for the assembler, such as spa for SPAdes or cap for cap3.  Some jobs ran more than once before a successful run.  Each line in the file has sacct fields User, JobID, Jobname%40, partition, state, time, start, end, elapsed, MaxRss, MaxVMSize, MaxPages, TotalCPU, nnodes, ncpus, and nodelist. 

      Subdirectory scripts also includes a much older (pre-2015) set of scripts based on an early version of SLAG, which was called localassembly0215.pl.  These scripts were used to investigate local assembly of multiple alleles of a gene across an invariant central region, which ultimately gave rise to Figure 12 in "SLAG: A Program for Seeded Local Assembly of Genes in Complex Genomes".

Funding

Agricultural Research Service, Award: CRIS: 5020-22000-017-00D

Agricultural Research Service, Award: CRIS: 5020-22000-022-00D