Data from: RAD sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference

Mastretta-Yanes, Alicia, University of East Anglia

Arrigo, Nils, University of Lausanne

Alvarez, Nadir, University of Lausanne

Jorgensen, Tove H., Aarhus University

Piñero, Daniel, National Autonomous University of Mexico

Emerson, Brent C., University of East Anglia

Published Jun 10, 2014 on Dryad. https://doi.org/10.5061/dryad.g52m3

Cite this dataset

Mastretta-Yanes, Alicia et al. (2014). Data from: RAD sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference [Dataset]. Dryad. https://doi.org/10.5061/dryad.g52m3

Abstract

Restriction site-associated DNA sequencing (RADseq) provides researchers with the ability to record genetic polymorphism across thousands of loci for non-model organisms, potentially revolutionising the field of molecular ecology. However, as with other genotyping methods, RADseq is prone to a number of sources of error that may have consequential effects for population genetic inferences, and these have received only limited attention in terms of the estimation and reporting of genotyping error rates. Here we use individual sample replicates, under the expectation of identical genotypes, to quantify genotyping error in the absence of a reference genome. We then use sample replicates to (1) optimize de novo assembly parameters within the program Stacks, by minimizing error and maximizing the retrieval of informative loci, and; (2) quantify error rates for loci, alleles and SNPs. As an empirical example we use a double digest RAD dataset of a non-model plant species, Berberis alpina, collected from high altitude mountains in Mexico.

Usage notes

0.Demultiplexing and dropbase

Contains the custom Perl scripts of the pipeline used to demultiplex raw reads and drop a base position that was causing a lane effect.

Demultiplexing_and_dropbase.zip

1.Running Stacks

Contains scripts and part of the output data of running *Stacks* with the demultiplexed-lane-effect-corrected data (see section **Demultiplexing and dropbase/0.1DropBase** of this repository) available at the Sequence Read Archive (SRA), accession SRP035472; to subsequently produce SNP (*.SNP) and coverage (*.COV) tsv matrices and to run the *populations* program of Stacks with the selected loci of the downstream analyses. The Stacks parameter values corresponds to the experiments defined as *1) Exploratory analysis of Stacks assembly key parameters and SNP calling model using replicates* and *2) Effect of using different parameters on the output information content and on the detection of genetic structuring*

1stacks.zip

2.R analyses of Stacks outputs

Contains the R scripts, input-output data and metainformation used to perform the analyses described in the *General processing of Stacks outputs* and *Error rates* of the manuscript.

2R.zip

Data (SRA SRP035472)

The postdemultiplexing quality filtered data (i.e. output of "Demultiplexing and dropbase" of this repository) is available at the Sequence Read Archive (SRA), accession SRP035472. Those files were used to run Stacks and the resulted matrices coverage and SNP matrices produced after running the Stacks script *export_sql.pl* along with the subsequent files used to perform the analyses on R. See contents of the [2R](./2R) directory. For the rest of the pipeline the scripts used and when available and output summary are presented inside each directory.

Sampling localities

Cointains and the geographic information of sampling sites. See tab "B.alpina B.moranensis" for Berberis alpina and B. moranensis populations and "Berberis outgroups" for B. trifolia and B. pallida

Sampling_localities_Berberis.xls

Figures

Contains the R markdown files that were used to generate the Figures from the main text (Figures.Rmd) and the figures from the supporting information (SupportingInfomation_*). Needs the data and scripts from the *2.R analyses of Stacks outputs* section of this repository

Lab protocol and sequencing report

Contains: i) a summary of the ddRAD labwork, description of final libraries and sequencing output, ii) modified ddRAD sequencing protocol and iii) sequencing quality control reports for each lane.

LabProtocol_and_sequencing_report.zip

Location

Cerro Zamorano

Ajusco

Sierra Madre Occidental

La Malinche

Cerro Tlaloc

Nevado de Toluca

Faja Volcanica Transmexicana

Iztaccihuatl

Trans Mexican Volcanic Belt

Mexico

Cerro San Andres

Cofre de Perote

Transmexican Volcanic Belt