Protein coding sequences (CDS) of the genome of a Mus musculus
Data files
May 01, 2026 version files 874.55 KB
-
CDS_Mus_singles_names.xlsx
873.53 KB
-
README.md
1.02 KB
Abstract
This dataset contains 22,759 non-redundant protein-coding sequences (CDS) from the Mus musculus C57BL/6 reference genome (GenBank accession GCF_000001635.27_GRCm39). Each CDS is annotated with a gene symbol, GenBank reference ID, and sequence length in base pairs. This curated CDS reference file enables consistent cross-species comparisons at the transcript level and was used to identify orthologous expression patterns and differential regulation in response to viral infection. The values in this dataset are gene-level annotations, GenBank protein references, and sequence sizes. The file is structured as a single spreadsheet, with each row representing a unique gene and its corresponding CDS. The dataset is reusable for any study requiring canonical CDS references from M. musculus C57BL/6, particularly in contexts of comparative transcriptomics.
Title:
Reference coding sequences from Mus musculus (C57BL/6) genome assembly
Organism:
Mus musculus (C57BL/6 strain)
Reference Genome:
GenBank accession: GCF_000001635.27_GRCm39
File:
CDS_Mus_singles_names.xlsx
This file contains a curated list of 22,759 non-redundant protein-coding sequences (CDS) from the Mus musculus C57BL/6 genome assembly. Each row in the dataset represents a unique gene and includes the gene symbol, corresponding GenBank or UniProt reference, and the coding sequence length in base pairs.
Contents:
CDS_Mus_singles_names.xlsx
Variables (Columns):
- gene: Gene symbol (string)
- reference: GenBank or UniProt protein accession identifier (string)
- size: Length of the coding sequence in base pairs (integer)
Units:
- Size is measured in base pairs (bp)
Recommended Citation:
If using this dataset, please cite:
Genome Reference Consortium Mouse Build 39 (GRCm39), GenBank accession GCF_000001635.27.
