Skip to main content
Dryad

Algorithms for determining transposable genes in a genome

Abstract

Transposons are nucleotide sequences in DNA that can change their positions. Many transposons are shorter than a general gene. When we restrict to nucleotide sequences that form complete genes, we can still find genes that change their relative locations in a genome. Thus for different individuals of the same species, the orders of genes might be different. A practical problem is to determine such transposable genes in given gene sequences. Through an intuitive rule, we transform the biological problem of determining transposable genes into a rigorous mathematical problem of determining the longest common subsequence. Depending on whether the gene sequence is linear (each sequence has a fixed head and tail) or circular (we can choose any gene as the head, and the previous one is the tail), and whether genes have multiple copies, we classify the problem of determining transposable genes into four scenarios: (1) linear sequences without duplicated genes; (2) circular sequences without duplicated genes; (3) linear sequences with duplicated genes; (4) circular sequences with duplicated genes. With the help of graph theory, we design fast algorithms for different scenarios. Specifically, we study the situation where the longest common subsequence is not unique.

This dataset contains code files for the corresponding algorithms. Besides, it has gene sequence data for certain Escherichia coli strains (from NCBI), which are used to test those algorithms.