Premise of the study: The One Thousand Plant Transcriptomes Project (1KP, 1000+ assembled plant transcriptomes) provides an enormous resource for developing microsatellite loci across the plant tree of life. We developed loci from these transcriptomes and tested their utility. Methods and Results: Using software packages and custom scripts, we identified microsatellite loci in 1KP transcriptomes. We assessed the potential for cross-amplification and whether loci were biased toward exons, as compared to markers derived from genomic DNA. We characterized over 5.7 million simple sequence repeat (SSR) loci from 1334 plant transcriptomes. Eighteen percent of loci substantially overlapped with open reading frames (ORFs), and electronic PCR revealed that over half the loci would amplify successfully in conspecific taxa. Transcriptomic SSRs were approximately three times more likely to map to translated regions than genomic SSRs. Conclusions: We believe microsatellites still have a place in the genomic age—they remain effective and cost-efficient markers. The loci presented here are a valuable resource for researchers.
README
ReadMe file describing the data package
LocusInfo
The LocusInfo.zip file is a compressed zip file containing a directory with the microsatellite loci for each species.
The the LocusInfo directory contains the microsatellite locus information for the 1096 species for which loci were developed. The loci were from the 1KP project (http://onekp.com) transcriptomes. For some species, there were multiple transcriptomes, either from multiple collections or multiple tissues; these have been combined into a single file for each species.
Scaffolds
The Scaffolds.zip file is a compressed zip file containing a directory with the scaffolds corresponding to the microsatellite loci.
The Scaffolds directory contains the sequence of the scaffolds that had microsatellites identified on them. The files are bz2 compressed zip files. The unzipped file is a fasta file with all of the scaffolds corresponding to the loci in the LocusInfo directory.
GCF_000004515.3_V1.1_genomic
The script "BLAST_to_Coding_SSR.R” (https://github.com/soltislab/transcriptome_microsats/blob/master/BLAST_to_Coding_SSR.R) uses a .gff file (annotated Glycine max genome from NCBI), and a BLAST report for SSR Loci blasted against the Glycine max genome to prepare two files, which will be used in a subsequent script (Coding_SSR.py -- https://github.com/soltislab/transcriptome_microsats/blob/master/Coding_SSR.py) to determine which loci are in translated regions of the genome (i.e., regions that are annotated as "CDS").
The output of this script is two files (one contains the SSR loci identified from the BLAST search, with some unncessary columns and duplicates removed, and the other contains the regions of the Glycine max genome that are annotated as "CDS").
This file is the gff file needed for the script.
G_R_0764-0510-HitTable
The script "BLAST_to_Coding_SSR.R” (https://github.com/soltislab/transcriptome_microsats/blob/master/BLAST_to_Coding_SSR.R) uses a .gff file (annotated Glycine max genome from NCBI), and a BLAST report for SSR Loci blasted against the Glycine max genome to prepare two files, which will be used in a subsequent script (Coding_SSR.py -- https://github.com/soltislab/transcriptome_microsats/blob/master/Coding_SSR.py) to determine which loci are in translated regions of the genome (i.e., regions that are annotated as "CDS").
The output of this script is two files (one contains the SSR loci identified from the BLAST search, with some unncessary columns and duplicates removed, and the other contains the regions of the Glycine max genome that are annotated as "CDS").
This is the second of three files needed to run the script.
G_R_0803-0253-HitTable
The script "BLAST_to_Coding_SSR.R” (https://github.com/soltislab/transcriptome_microsats/blob/master/BLAST_to_Coding_SSR.R) uses a .gff file (annotated Glycine max genome from NCBI), and a BLAST report for SSR Loci blasted against the Glycine max genome to prepare two files, which will be used in a subsequent script (Coding_SSR.py -- https://github.com/soltislab/transcriptome_microsats/blob/master/Coding_SSR.py) to determine which loci are in translated regions of the genome (i.e., regions that are annotated as "CDS").
The output of this script is two files (one contains the SSR loci identified from the BLAST search, with some unncessary columns and duplicates removed, and the other contains the regions of the Glycine max genome that are annotated as "CDS").
This is the third of three files needed to run the script.
glycine_max_454_raw_all
A fasta file is necessary to run pal_finder. The file included here is a fasta file of raw 454 genomic reads from Glycine max (NCBI Trace Archive (TI 1732557604-1733276192; Swaminathan et al., 2007)).