Data from: high repeat content in the genomes of sparrows: the importance of genome assembly completeness for transposable element discovery
Data files
Dec 13, 2023 version files 24.47 MB
-
README.md
-
supplemental_files_code.zip
Abstract
Transposable elements (TE) play critical roles in shaping genome evolution. However, the highly repetitive sequence content of TEs is a major source of assembly gaps. This makes it difficult to decipher the impact of these elements on the dynamics of genome evolution. The increased capacity of long-read sequencing technologies to span highly repetitive regions of the genome should provide novel insights into patterns of TE diversity. Here we report the generation of highly contiguous reference genomes using PacBio long read and Omni-C technologies for three species of sparrows in the family Passerellidae. To assess the influence of sequencing technology on TE annotation, we compared these assemblies to three chromosome-level sparrow assemblies recently generated by the Vertebrate Genomes Project and nine other sparrow species generated using a variety of short- and long-read technologies. All long-read based assemblies were longer in length (range: 1.12-1.41 Gb) than short-read assemblies (0.91-1.08 Gb). Assembly length was strongly correlated with the amount of repeat content, with longer genomes showing much higher levels of repeat content than typically reported for the avian order Passeriformes. Repeat content for the Bell's sparrow (31.2% of genome) was the highest level reported to date for a songbird genome assembly and was more in line with woodpecker (order Piciformes) genomes. CR1 LINE elements retained from an expansion that occurred 25-30 million years ago were the most abundant TEs in the song sparrow genome. Although the other five sparrow species also exhibit evidence for a spike in CR1 LINE activity at 25-30 million years ago, LTR elements stemming from more recent expansions were the most abundant elements in these species. LTRs were uniquely abundant in the Bell's sparrow genome deriving from two recent peaks of activity. Higher levels of repeat content (79.2-93.7%) were found on the W chromosome relative to the Z (20.7-26.5) or autosomes (16.1-30.9%). These patterns support a dynamic model of transposable element expansion and contraction underpinning the seemingly constrained and small sized genomes of birds. Our work highlights how the resolution of difficult-to-assemble regions of the genome with new sequencing technologies promises to transform our understanding of avian genome evolution.
README: README: Data from: high repeat content in the genomes of sparrows: the importance of genome assembly completeness for transposable element discovery
https://doi.org/10.5061/dryad.cjsxksncs
Supplementary datasets and code for the manuscript: High repeat content in the genomes of sparrows: the importance of genome assembly completeness for transposable element discovery.
Description of the data and file structure
Five folders containing data and code for the manuscript.
(1) GenomeSizeVariation
* GenomeAssemblySize.csv: Genome assembly length versus c-value genome size length.
column names
Species: Scientific name of Passerelidae sparrow
corrCvalue: c-value estimate of genome length corrected using that 1pg = 0.978 Gb.
assembly: genome size estimate based on assembly length.
GenomeSize_passerellidae.csv: Genome size dataset for members of the sparrow family Passerellidae.
column names
Species: Scientific name of Passerelidae sparrow
C_value: Genome size estimate in picograms (pg)
Adjusted_genome_size: c-value estimate of genome length in Gigabases (Gb) adjusted using 1pg = 0.978 Gb.
CellType: cell type used for c-value measurements
method: method used to generate c-value estimates
source: publication from which data were obtainedTE_evolution_plot.py: Python code to make histogram of c-value genome size estimates for Passerellidae sparrows.
(2) PhylogenyConstruction
* MCMCtrees.sparrows.ctl: parameter file for MCMCtree analysis
* Sparrow_UCE_Tree.txt: Newick format file for input to MCMCtree with calibrated node
* RAxML_input_parameters_CIPRES.pdf: Input parameters for running RAxML on the CIPRES computing cluster
* sparrow-cleaned-95p.phylip.txt: phylip formatted UCEs for all sparrows used to construct phylogenies.
(3) RepeatAnnotation
* Passerellidae.final_TElibrary.fa: Final library of consensus transposable element sequence. Generated from de novo annotation in RepeatModeler2, manual curation, and merger with other avian repeats.
* RepeatContent_Sparrows.csv: summary data of RepeatMasker results for each sparrow species.
column names (NA signifies missing data)
Species: Species name using 4 letter alpha codes. (white-crowned sparrow: WCSP; California towhee: CALT;
swamp sparrow: SWSP; Nelson's sparrow: NESP; saltmarsh sparrow short-read: SALS_SR; saltmarsh sparrow long-read: SALS_VG;
Savannah sparrow: SAVS; Bell's sparrow: BESP; song sparrow: SOSP; white-throated sparrow: WTSP; dark-eyed junco: DEJU;
chipping sparrow: CHSP; grasshopper sparrow: GRSP)
LINE: percent LINE elements present in genome
SINE: percent SINE elements present in genome
LTR: percent LTR elements present in genome
DNA: percent DNA elements present in genome
RC: percent RC (rolling circle) elements present in genome
Unclassified: percent Unclassified elements present in genome
Total: Percent of entire genome spanned by transposable elements
Genome_length_Gb: Length of genome assembly in Gigabases (Gb)
contig_N50_Mb: The contig N50 of the assembly in Megabases (Mb)
ContigLoHigh: Whether the contig N50 is above or below the 1Mb threshold
corrCvalue: c-value estimate of genome length corrected using that 1pg = 0.978 Gb.
Missing_length: Amount of missing data based on c-value minus assembly length
Per_missing: percent of missing DNA from genome assembly.
- Repeat_assemblyPlot.R: R code for generating scatter plots and regression analyses of genome assembly length versus repeat content or DNA missingness and % repeat content.
- SOSP_genome_plot.R: R code for generating barplots comparing total number of TE elements and amount of genome spanned by TEs across different genome assemblies for song sparrow and saltmarsh sparrow.
(4) RepeatMasker_results
Folder with .tbl output files from RepeatMasker for each species and sex chromosomes for some species.
(5) TE_landscapes
Divergence landscapes for each species and R code for generating TE landscape figures. These landscapes show the distribution of divergence times
from the consensus sequence for each TE element family. CSV files exist for each sparrow species and chromosome and serve as input to R code for
generating the TE landscape figures. Each CSV file has estimates of percent divergence from consensus for each TE class in 1% bins from 0-70% divergence.
We summed across different TE families within each of the 4 classes to get a total length of LTR sequence (for example) that is x% divergent from each LTR consensus sequence.
Column names for csv (sex chromosome files do not have Myrs column)
Div: The percent divergence from the consensus TE sequence. Estimated as Kimura-2-parameter distance with CpG sites excluded.
Myrs: Estimate of divergence time form the consensus in millions of years (Myrs).
length: length of genome spanned by TEs that are within each divergence from consensus bin
element: element class, includes LINE, DNA, LTR, and SINE elements
Sharing/Access information
Please contact Phred Benham (phbenham@gmail.com) if there are any questions or issues about the code and data shared here.