Skip to main content
Dryad logo

Datasets for manuscript: The nature, representation and measure of genetic information

Citation

Thorvaldsen, Steinar (2022), Datasets for manuscript: The nature, representation and measure of genetic information, Dryad, Dataset, https://doi.org/10.5061/dryad.h44j0zpn6

Abstract

Current studies in genetics very often refer to notions from information science. The concept of genetic information is still disputed because it attributes semantic traits to what seem to be regular biochemical entities. Some researchers maintain that the use of information in biology is just metaphorical and maybe even misleading. In this paper, we offer an analysis of the nature and characteristics of the use of information in proteins, protein families, and their sequences. It is argued that the foundation of the metaphorical view is relatively weak given the current findings in bioinformatics, and it is shown that the present understanding of genetics fits well into the context of the modern philosophy of information. Here, we propose an extension of Floridi’s conceptual model of information to include genetic information better. In addition, we discuss how to understand the qualitative aspects of genetic information and how to measure its quantitative aspects and present a joint statistical model including qualitative genetics, where the nominal genetic function is represented jointly with its metric self-information. The functional information of protein families in the Cath and Pfam databases are analysed. The paper concludes that scientific work may place information firmly as one of the fundamental components of molecular biology.

The protein alignment files from the Cath and Pfam databases are in FASTA format.

Methods

Searched for protein alignments in the Cath and Pfam databases, and converted the files to FASTA format.

Alignments are obtained from http://www.cathdb.info/browse/  (version 4.3),

and from: https://pfam.xfam.org/browse  (version 34.0).

Usage Notes

The dataset consist of standard FASTA files, and the data may be read and analysed by the Matlab toolbox DeltaProt.

Ref:
Thorvaldsen, S., Flå, T. and Willassen, N.P. (2010). DeltaProt: A software toolbox for comparative genomics. BMC Bioinformatics 11, 573. https://doi.org/10.1186/1471-2105-11-573