Data from: Mating system variation and gene expression in the male reproductive tract of Peromyscus mice
Data files
Jun 05, 2024 version files 1.10 GB
-
P_boylii_transcriptome_Voss_Nachman_2024.fa
398.80 MB
-
P_californicus_transcriptome_Voss_Nachman_2024.fa
351.46 MB
-
P_maniculatus_transcriptome_Voss_Nachman_2024.fa
353.54 MB
-
README.md
11.30 KB
Oct 03, 2024 version files 1.10 GB
-
Corrected_Supp_Table1_Voss_Nachman_2024.xlsx
10.77 KB
-
P_boylii_transcriptome_Voss_Nachman_2024.fa
398.80 MB
-
P_californicus_transcriptome_Voss_Nachman_2024.fa
351.46 MB
-
P_maniculatus_transcriptome_Voss_Nachman_2024.fa
353.54 MB
-
README.md
11.86 KB
Abstract
Genes involved in reproduction often evolve rapidly at the sequence level due to postcopulatory sexual selection (PCSS) driven by male-male competition and male-female sexual conflict, but the impact of PCSS on gene expression has been under-explored. Further, though multiple tissues contribute to male reproductive success, most studies have focused on the testes. To explore the influence of mating system variation on reproductive tract gene expression in natural populations, we captured adult males from monogamous Peromyscus californicus and polygynandrous P. boylii and P. maniculatus. We generated RNAseq libraries, quantified gene expression in the testis, seminal vesicle, epididymis, and liver, and identified 3,627 mating system-associated differentially expressed genes (MS-DEGs), where expression shifted in the same direction in P. maniculatus and P. boylii relative to P. californicus. Gene expression variation was most strongly associated with mating behavior in the seminal vesicles, where 89% of differentially expressed genes were MS-DEGs, including key seminal fluid proteins Svs2 and Pate4. We also used published rodent genomes to test for positive and relaxed selection on Peromyscus-expressed genes. Though we did not observe more overlap than expected by chance between MS-DEGs and positively selected genes, 203 MS-DEGs showed evidence of positive selection. Fourteen reproductive genes were under tree-wide positive selection but convergent relaxed selection in P. californicus and Microtus ochrogaster, a distantly related monogamous species. Changes in transcript abundance and gene sequence evolution in association with mating behavior suggest that male mice may respond to sexual selection intensity by altering aspects of sperm motility, sperm-egg binding, and copulatory plug formation.
https://doi.org/10.5061/dryad.b5mkkwhmw
Description of the data and file structure
This dataset includes three de novo assembled transcriptomes for Peromyscus maniculatus (species abbreviation: PEMA) , Peromyscus boylii (PEBO), and Peromyscus californicus (PECA).
**File names: **
Peromyscus boylii: P_boylii_transcriptome_Voss_Nachman_2024.fa
Peromyscus californicus: P_californicus_transcriptome_Voss_Nachman_2024.fa
Peromyscus maniculatus: P_maniculatus_transcriptome_Voss_Nachman_2024.fa
Corrected_Supp_Table1_Voss_Nachman_2024.xlsx: list of samples included in study along with their Museum of Vertebrate Zoology (MVZ) catalog IDs, which can be searched for metadata and tissue loan and specimen availability on the MVZ Arctos Museum Database.
Experiment and file overview:
This dataset contains transcriptomes that were individually assembled from RNAseq data for three species of Peromyscus mice. The goal of this study was to compare reproductive gene expression among male mice from species that vary in mating behavior and strength of postcopulatory sexual selection. We focused on four tissues: testis, seminal vesicle, epididymis, and liver. For each species, RNAseq gene expression data from one sample for each tissue was used to identify transcripts and assemble the resulting transcriptome.
Each file is in fasta format and contains the sequences of all transcripts that were annotated to a known open reading frame from the Peromyscus maniculatus reference genome (Lassance and Hoekstra 2020). After transcriptome assembly, we used transcriptomes as a reference point from which to begin differential gene expression analyses. Raw read data for all samples and tissues where gene expression was quantified is available on NCBI under BioProject PRJNA1068126.
There is also an excel file that contains sampling localities, species identification, and museum catalog accession numbers. All specimens are being accessioned into the Museum of Vertebrate Zoology collection at the University of California, Berkeley.
Peromyscus maniculatus: polygynandrous (loosely affiliative multimale - multifemale)
Peromyscus boylii: polygynandrous
Peromyscus californicus: monogamous
Study questions: Are different patterns of gene expression associated with different mating systems? Which male reproductive tissues show the greatest changes in gene expression in different mating systems? Which specific genes are expressed at very high levels in polygynandrous species and at low levels in monogamous species?
Specimen collection information
Gene expression data was generated from tissues sampled from wild-caught mice from Hastings Natural History Reserve (Carmel Valley, CA) and the Field Station for the Study of Behavior, Ecology, and Reproduction (Berkeley, CA). RNAseq data was generated from the testis, liver, epididymis, and seminal vesicle using the KAPA HyperPrep RNAseq library preparation kit and sequenced on an Illumina NovaSeq S4 at Vincent J. Coates Genome Sequencing Laboratory at UC Berkeley. Specimens were accessioned into the mammal collection at the Museum of Vertebrate Zoology and sample metadata is available on the Arctos museum database.
Transcriptome Assembly Methods
Transcriptomes were assembled starting from filtered, cleaned RNAseq data using the Trinity assembler (Haas et al., 2013). Transcripts with more than 95% similarity were collapsed (CD-Hit: Li and Godzik 2006; Fu et al., 2012) and chimeric and mis-assembled sequences were removed (RNAquast: Bushmanova et al., 2016) before transcripts were annotated using Trinotate (Bryant et al., 2017). Completeness was assessed with BUSCO (Euarchontoglires odb10 score: 87.8%; Simão et al., 2015). Transcriptomes were used to quantify transcript abundance across tissues and species in downstream differential expression analyses.
All software versions used in this pipeline are given below.
Transcript Naming Conventions
Transcripts are named according to a combination of annotated gene identifiers and Trinity-generated sequence IDs. The sequence ID begins with the Ensembl P. maniculatus gene code, followed by “_i#” to differentiate between unique isoforms of each annotated gene. Trinity de novo assembler identifiers are preserved second in the sequence ID to retain the initial transcriptome assembly file structure, which includes more complete gene and isoform information for each species. Lastly, additional annotations from the P. californicus and M. musculus reference CDS are appended. These genomes were used to to provide support for and assess the accuracy level of P. maniculatus annotations.
All reference genome identifiers are listed below.
Differential Gene Expression Analyses
After transcriptomes were assembled, they were used as reference sequences to quantify gene expression from five individual per species for each of four tissues: testis, liver, epididymis, and seminal vesicle. We then used Salmon (v.1.10.0, Patro et al., 2017) to map cleaned fastq RNAseq reads to the reference transcriptome and DESeq2 (v.1.40.1, Love et al., 2014) to identify instances of differential expression between species for each tissue. Significantly differentially expressed genes that shifted in the same direction in both polygynandrous species relative to the monogamous species with reproductive gene ontology terms are provided in the supplementary materials for the manuscript, and complete differential expression results are available on request.
All raw RNAseq data (n = 58 samples: 15 testis, 15 liver, 14 seminal vesicle, and 14 epididymis) generated for this study is available on the NCBI Sequence Read Archive under BioProject PRJNA1068126. Metadata associated with each individual is available in the supplementary materials for the manuscript.
Sharing/Access information
Data was derived from the following sources:
- Raw RNAseq read data that was used to create transcriptomes is available at NCBI Sequence Read Archive BioProject PRJNA1068126.
- Transcriptomes were annotated using:
- Peromyscus maniculatus reference genome CDS available at NCBI GenBank: HU_Pman_2.1.3, GCA_003704035 (Lassance and Hoekstra 2020)
- Peromyscus californicus insignis reference genome CDS available at NCBI GenBank: ASM782708v3 GCA_007827085.3 (Trainor et al., 2022)
- Mus musculus reference genome CDS available at NCBI GenBank: GRCm39 GCA_000001635.9
- Specimen metadata is available in the supplementary material for the associated manuscript (Voss and Nachman 2024, Molecular Ecology) and online in the Museum of Vertebrate Zoology (University of California, Berkeley) Mammal Collection catalog on the Arctos Database at https://arctos.database.museum/home.cfm.
Code/Software
- Raw Illumina RNAseq reads were quality assessed with fastQC v.0.11.9 (Andrews 2010) and filtered and trimmed with fastp v.0.23.2 (Chen et al., 2018).
- Transcriptomes were assembled using the de novo pipeline with Trinity v.2.15.1 (Haas et al., 2013). Then, we used CD-HIT v.4.8.1 (Li and Godzik 2006; Fu et al., 2012) to cluster sequences with greater than 95% sequence similarity and rnaquast v.2.2.2 (Bushmanova et al., 2016) to remove chimeric or misassembled sequences.
- We assessed transcriptome completeness using Busco v.5.4.5 (Simão et al., 2015) with Euarchontoglires odb10
- TransDecoder v.5.7.0 was used to predict open reading frames, followed by Trinotate v.3.2.2 (Bryant et al., 2017) with the Ensembl protein database for P. maniculatus (HU_Pman_2.1.3) to provide primary annotation.
- All analyses were performed using bash scripting on the University of California, Berkeley HPC cluster (Savio). Scripts for transcriptome assembly and downstream differential gene expression analyses are available from the authors upon request.
References
Bryant, D. M., Johnson, K., DiTommaso, T., Tickle, T., Couger, M. B., Payzin-Dogru, D., Lee, T. J., Leigh, N. D., Kuo, T.-H., Davis, F. G., Bateman, J., Bryant, S., Guzikowski, A. R., Tsai, S. L., Coyne, S., Ye, W. W., Freeman, R. M., Jr, Peshkin, L., Tabin, C. J., … Whited, J. L. (2017). A Tissue-Mapped Axolotl De Novo Transcriptome Enables Identification of Limb Regeneration Factors. Cell Reports ,18 (3), 762–776. https://doi.org/10.1016/j.celrep.2016.12.063
Bushmanova, E., Antipov, D., Lapidus, A., Suvorov, V., & Prjibelski, A. D. (2016). rnaQUAST: a quality assessment tool for de novo transcriptome assemblies. Bioinformatics , 32 (14), 2210–2212.
Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics , 34 (17), i884–i890. https://doi.org/10.1093/bioinformatics/bty560
[dataset] Trainor, B. C., Fisher, H., & Li, J. (2022). Whole genome sequencing ofPeromyscus californicus ; NCBI GenBank; ASM782708v3; https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_007827085.1/
Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics ,28 (23), 3150–3152. https://doi.org/10.1093/bioinformatics/bts565
[dataset] Genome Reference Consortium. (2020). Mus musculus genome assembly; NCBI GenBank; GRCm39; https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001635.27/
Haas, B. J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P. D., Bowden, J., Couger, M. B., Eccles, D., Li, B., Lieber, M., MacManes, M. D., Ott, M., Orvis, J., Pochet, N., Strozzi, F., Weeks, N., Westerman, R., William, T., Dewey, C. N., … Regev, A. (2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols , 8 (8), 1494–1512. https://doi.org/10.1038/nprot.2013.084
[dataset] Lassance, J.-M., & Hoekstra, H. E. (2018). Improved assembly of the deer mouse Peromyscus maniculatus genome; NCBI GenBank; HU_Pman_2.1.3; https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_003704035.3/
Li, W., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.Bioinformatics , 22 (13), 1658–1659. https://doi.org/10.1093/bioinformatics/btl158
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology , 15 (12), 550. https://doi.org/10.1186/s13059-014-0550-8
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods , 14 (4), 417–419. https://doi.org/10.1038/nmeth.4197
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., & Zdobnov, E. M. (2015). BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics , 31 (19), 3210–3212. https://doi.org/10.1093/bioinformatics/btv351
This dataset includes three de novo assembled transcriptomes for Peromyscus maniculatus, Peromyscus boylii, and Peromyscus californicus.
Gene expression data was generated from tissues sampled from wild-caught mice from Hastings Natural History Reserve (Carmel Valley, CA) and the Field Station for the Study of Behavior, Ecology, and Reproduction (Berkeley, CA). RNAseq data was generated from the testis, liver, epididymis, and seminal vesicle using the KAPA HyperPrep RNAseq library preparation kit and sequenced on an Illumina NovaSeq S4 at Vincent J. Coates Genome Sequencing Laboratory at UC Berkeley. Specimens were accessioned into the mammal collection at the Museum of Vertebrate Zoology and sample metadata is available on the Arctos museum database.
Transcriptomes were assembled starting from filtered, cleaned RNAseq data using the Trinity assembler. Transcripts with more than 95% similarity were collapsed and chimeric and mis-assembled sequences were removed before transcripts were annotated using Trinotate and the Peromyscus maniculatus reference genome (Lassance and Hoestra, NCBI GenBank: HU_Pman_2.1.3). Completeness was assessed with BUSCO (Euarchontoglires odb10 score: 87.8%).
Transcriptomes were used to quantify transcript abundance across tissues and species in downstream differential expression analyses.