Data from: A new resource for the development of SSR markers: millions of loci from a thousand plant transcriptomes

Hodel, Richard G.; Gitzendanner, Matthew A.1; Germain-Aubrey, Charlotte C.1; Liu, Xiaoxian1; Crowl, Andrew A.1; Sun, Miao1; Landis, Jacob B.1; Segovia-Salcedo, Maria Claudia; Douglas, Norman A.1; Chen, Shichao2; Soltis, Douglas E.1; Soltis, Pamela S.1; Hodel, Richard G. J.1

Published May 05, 2017 on Dryad. https://doi.org/10.5061/dryad.rb7h0

Data files

May 05, 2017 version files 2.46 GB

G_R_0764-0510-HitTable.csv

61.39 MB
G_R_0803-0253-HitTable.csv

58.30 MB
GCF_000004515.3_V1.1_genomic.gff

251.86 MB
glycine_max_454_raw_all.fasta

114.40 MB
LocusInfo.zip

352.83 MB
README.txt

5.65 KB
Scaffolds.zip

1.62 GB

Abstract

Premise of the study: The One Thousand Plant Transcriptomes Project (1KP, 1000+ assembled plant transcriptomes) provides an enormous resource for developing microsatellite loci across the plant tree of life. We developed loci from these transcriptomes and tested their utility. Methods and Results: Using software packages and custom scripts, we identified microsatellite loci in 1KP transcriptomes. We assessed the potential for cross-amplification and whether loci were biased toward exons, as compared to markers derived from genomic DNA. We characterized over 5.7 million simple sequence repeat (SSR) loci from 1334 plant transcriptomes. Eighteen percent of loci substantially overlapped with open reading frames (ORFs), and electronic PCR revealed that over half the loci would amplify successfully in conspecific taxa. Transcriptomic SSRs were approximately three times more likely to map to translated regions than genomic SSRs. Conclusions: We believe microsatellites still have a place in the genomic age—they remain effective and cost-efficient markers. The loci presented here are a valuable resource for researchers.

README

ReadMe file describing the data package

LocusInfo

The LocusInfo.zip file is a compressed zip file containing a directory with the microsatellite loci for each species. The the LocusInfo directory contains the microsatellite locus information for the 1096 species for which loci were developed. The loci were from the 1KP project (http://onekp.com) transcriptomes. For some species, there were multiple transcriptomes, either from multiple collections or multiple tissues; these have been combined into a single file for each species.

Scaffolds

The Scaffolds.zip file is a compressed zip file containing a directory with the scaffolds corresponding to the microsatellite loci. The Scaffolds directory contains the sequence of the scaffolds that had microsatellites identified on them. The files are bz2 compressed zip files. The unzipped file is a fasta file with all of the scaffolds corresponding to the loci in the LocusInfo directory.

GCF_000004515.3_V1.1_genomic

The script "BLAST_to_Coding_SSR.R” (https://github.com/soltislab/transcriptome_microsats/blob/master/BLAST_to_Coding_SSR.R) uses a .gff file (annotated Glycine max genome from NCBI), and a BLAST report for SSR Loci blasted against the Glycine max genome to prepare two files, which will be used in a subsequent script (Coding_SSR.py -- https://github.com/soltislab/transcriptome_microsats/blob/master/Coding_SSR.py) to determine which loci are in translated regions of the genome (i.e., regions that are annotated as "CDS"). The output of this script is two files (one contains the SSR loci identified from the BLAST search, with some unncessary columns and duplicates removed, and the other contains the regions of the Glycine max genome that are annotated as "CDS"). This file is the gff file needed for the script.

G_R_0764-0510-HitTable

G_R_0803-0253-HitTable

glycine_max_454_raw_all

A fasta file is necessary to run pal_finder. The file included here is a fasta file of raw 454 genomic reads from Glycine max (NCBI Trace Archive (TI 1732557604-1733276192; Swaminathan et al., 2007)).