Skip to main content

Data for: Rapid alignment updating with Extensiphy

Cite this dataset

Field, Jasper T. (2021). Data for: Rapid alignment updating with Extensiphy [Dataset]. Dryad.


1. High throughput sequencing has become commonplace in evolutionary studies. Large, rapidly collected genomic datasets are used to capture biodiversity and for monitoring global and national scale disease transmission patterns, among many other applications. Updating homologous sequence datasets with new samples is cumbersome, requiring excessive program runtimes and data processing. We describe Extensiphy, a bioinformatics tool to efficiently update multiple sequence alignments with whole-genome short-read data. Extensiphy performs reference based sequence assembly and alignment in one process while maintaining the alignment length of the original alignment. Input data- types for Extensiphy are any multiple sequence alignment in fasta format and whole-genome, short-read fastq sequences. 

2. To validate Extensiphy, we compared its results to those produced by two other methods that construct whole-genome scale multiple sequence alignments. We measured our comparisons by analyzing program runtimes, base-call accuracy, dataset retention in the presence of missing data and phylogenetic accuracy. 

3. We found that Extensiphy rapidly produces high-quality updated sequence alignments while preventing alignment shrinkage due to missing data. Phylogenies estimated from alignments produced by Extensiphy show similar accuracy to other commonly used alignment construction methods. 

4. Extensiphy is suitable for updating large sequence alignments and is ideal for studies of biodiversity, ecology and epidemiological monitoring efforts.


All data collection and data processing is described in Rapid Alignment Updating with Extensiphy. These are the files of all start and end points for both empirical and simulated data.

Usage notes

The README.txt file included with these data describes the folder structure and what each file or folder contains.


National Science Foundation, Award: 1759846