Skip to main content
Dryad

Algorithms for determining transposable genes in a genome

Cite this dataset

Wang, Yue (2022). Algorithms for determining transposable genes in a genome [Dataset]. Dryad. https://doi.org/10.5061/dryad.9zw3r22j3

Abstract

Transposons are nucleotide sequences in DNA that can change their positions. Many transposons are shorter than a general gene. When we restrict to nucleotide sequences that form complete genes, we can still find genes that change their relative locations in a genome. Thus for different individuals of the same species, the orders of genes might be different. A practical problem is to determine such transposable genes in given gene sequences. Through an intuitive rule, we transform the biological problem of determining transposable genes into a rigorous mathematical problem of determining the longest common subsequence. Depending on whether the gene sequence is linear (each sequence has a fixed head and tail) or circular (we can choose any gene as the head, and the previous one is the tail), and whether genes have multiple copies, we classify the problem of determining transposable genes into four scenarios: (1) linear sequences without duplicated genes; (2) circular sequences without duplicated genes; (3) linear sequences with duplicated genes; (4) circular sequences with duplicated genes. With the help of graph theory, we design fast algorithms for different scenarios. Specifically, we study the situation where the longest common subsequence is not unique.

This dataset contains code files for the corresponding algorithms. Besides, it has gene sequence data for certain Escherichia coli strains (from NCBI), which are used to test those algorithms.

Methods

Gene sequences are from NCBI database.

Usage notes

This repository contains Python code for all algorithms in my paper https://arxiv.org/abs/1506.02424.

NewScenario1.py implements Algorithms 1,2 for Scenario 1.

NewScenario2.py implements Algorithms 3,4 for Scenario 2.

Scenario3.py implements Algorithm 5 for Scenario 3.

Scenario4.py implements Algorithm 6 for Scenario 4.

NewScenario1 test.py runs Algorithms 1 and 2 on real data.

NewScenario2 test.py runs Algorithms 3 and 4 on real data.

S3test.py tests the performance of Algorithm 5 for Scenario 3 on various random graphs.

S4test.py tests the performance of Algorithm 6 for Scenario 4 on various random graphs.

CPxxxx.txt are processed gene sequences, used in tests of Scenarios 1 and 2

Escherichia coli xxxx.txt are original annotation files, used to generate CPxxxx.txt.

Process ST540.py processes three Escherichia coli xxxx.txt files to CPxxxx.txt

Process ST2747.py processes three Escherichia coli xxxx.txt files to CPxxxx.txt

Scenario1.py (outdated!) implements Algorithms 1 and 2 for Scenario 1.

Scenario2.py (outdated!) implements Algorithms 3 and 4 for Scenario 2.