Skip to main content
Dryad

Data from: The Cumulative Indel Model: fast and accurate statistical evolutionary alignment

Cite this dataset

De Maio, Nicola (2020). Data from: The Cumulative Indel Model: fast and accurate statistical evolutionary alignment [Dataset]. Dryad. https://doi.org/10.5061/dryad.rbnzs7h8m

Abstract

Sequence alignment is essential for phylogenetic and molecular evolution inference, as well as in many other areas of bioinformatics and evolutionary biology. Inaccurate alignments can lead to severe biases in most downstream statistical analyses. Statistical alignment based on probabilistic models of sequence evolution addresses these issues by replacing heuristic score functions with evolutionary model-based probabilities. However, score-based aligners and fixed-alignment phylogenetic approaches are still more prevalent than methods based on evolutionary indel models, mostly due to computational convenience. Here, I present new techniques for improving the accuracy and speed of statistical evolutionary alignment. The "cumulative indel model" approximates realistic evolutionary indel dynamics using differential equations. "Adaptive banding" reduces the computational demand of most alignment algorithms without requiring prior knowledge of divergence levels or pseudo-optimal alignments. Using simulations, I show that these methods lead to fast and accurate pairwise alignment inference. Also, I show that it is possible, with these methods, to align and infer evolutionary parameters from a single long synteny block (approximately 530kbp) between the human and chimp genomes. The cumulative indel model and adaptive banding can therefore improve the performance of alignment and phylogenetic methods.

Usage notes

The code for this project can be found at https://bitbucket.org/nicofmay/cumulativeindel/ .