Sars-Cov-2 and Mers sequences from human host with no unknown characters

Formentin, Marco 1 ; Marco, Favretti1; Roberto, Chignola2

Published Jan 02, 2024 on Dryad. https://doi.org/10.5061/dryad.9s4mw6mp2

Data files

Jan 02, 2024 version files 25.61 MB

Mers_dataset.txt

6.60 KB
README.md

2.60 KB
Sars_Cov_2_dataset_restricted.txt

149.45 KB
Sars_Cov_2_dataset.txt

25.45 MB

Abstract

The datasets are organized as follows: first column, number of bases in a given sequence; second, third, fourth and fifth columns, number of bases of type A, C, G and T, respectively, in the same sequence.

1) Sars-Cov-2 dataset. This dataset contains number of bases for the complete genome sequences from a human host, with none unknown characters. In the NCBI database, there are about 950.000 sequences with these characteristics.

2) Restricted Sars-Cov-2 dataset: This dataset contains number of bases for the complete sequences from a human host, with no unknown characters, with 29903 bases, that is of the same length as the reference sequence NC045512.2. We obtained, from the NCBI database, about 5600 sequences with such features.

3) Mers dataset: This dataset contains number of bases for the complete sequences of about 200 complete genome sequences from a human host, with no unknown characters.

https://doi.org/10.5061/dryad.9s4mw6mp2

We uploaded three datasets organized as follows: first column, number of bases in a given sequence; second, third, fourth and fifth columns, number of bases of type A, C, G and T, respectively, in the same sequence.
1) Sars-Cov-2 dataset. This dataset contains number of bases for the complete genome sequences from a human host, with none unknown characters. In the NCBI database, there are about 950.000 sequences with these characteristics.
2) Restricted Sars-Cov-2 dataset: This dataset contains number of bases for the complete sequences from a human host, with no unknown characters, with 29903 bases, that is of the same length as the reference sequence NC045512.2. We obtained, from the NCBI database, about 5800 sequences with such features.
3) Mers dataset: This dataset contains number of bases for the complete sequences of about 250 complete genome sequences from a human host, with no unknown characters.

Description of the data

Each dataset is organized as follows: first column, number of bases in a given sequence; second, third, fourth and fifth columns, number of bases of type A, C, G and T, respectively, in the same sequence. Each row reports the data calculated for successive sequences following the same order of the raw data (see links below).

Sharing information

Raw data are freely available at the following address:

https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?VirusLineage_ss=Severe%20acute%20respiratory%20syndrome%20coronavirus%202%20(SARS-CoV-2),%20taxid:2697049&SeqType_s=Nucleotide

https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=Middle%20East%20respiratory%20syndrome-related%20coronavirus,%20taxid:1335626

Code

Readfasta: a C++ code for sequence base count. Datasets are obtained We provide a C++ computer code that reads a dataset of nucleic acid sequences in FASTA format and returns the number of bases in each sequence. The output file seqcount.txt contains a table organized as follows: first column, number of bases in a given sequence; second, third, fourth and fifth columns, number of bases of type A, C, G and T, respectively, in the same sequence. Each row reports the data calculated for successive sequences following the same order of the dataset.