Aligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of this sequence data is publicly available, but can be extremely fragmentary (i.e., a combination of full genomes and genomic fragments), which can compound the computational issues related to MSA. Traditionally, alignments are produced with automated algorithms and then checked and/or corrected “by eye” prior to phylogenetic inference. However, this manual curation is inefficient at the data scales required of modern phylogenetics and results in alignments that are not reproducible. Recently, methods have been developed for fully automating alignments of large data sets, but it is unclear if these methods produce alignments that result in compatible phylogenies when compared to more traditional alignment approaches that combined automated and manual methods. Here we use approximately 33,000 publicly available sequences from the hepatitis B virus (HBV), a globally distributed and rapidly evolving virus, to compare different alignment approaches. Using one data set comprised exclusively of whole genomes and a second that also included sequence fragments, we compared three MSA methods: (1) a purely automated approach using traditional software, (2) an automated approach including by eye manual editing, and (3) more recent fully automated approaches. To understand how these methods affect phylogenetic results, we compared resulting tree topologies based on these different alignment methods using multiple metrics. We further determined if the monophyly of existing HBV genotypes was supported in phylogenies estimated from each alignment type and under different statistical support thresholds. Traditional and fully automated alignments produced similar HBV phylogenies. Although there was variability between branch support thresholds, allowing lower support thresholds tended to result in more differences among trees. Therefore, differences between the trees could be best explained by phylogenetic uncertainty unrelated to the MSA method used. Nevertheless, automated alignment approaches did not require human intervention and were therefore considerably less time-intensive than traditional approaches. Because of this, we conclude that fully automated algorithms for MSA are fully compatible with older methods even in extremely difficult to align data sets. Additionally, we found that most HBV diagnostic genotypes did not correspond to evolutionarily-sound groups, regardless of alignment type and support threshold. This suggests there may be errors in genotype classification in the database or that HBV genotypes may need a revision.

Cleaned_GenBank_Files.zip

Hepatitis B virus GenBank files after initial data filtering steps.

cleaned_genbank.zip

Genome_alignments.zip

Sequence alignments of hepatitis B virus genomes and the S-region. Files include the manual genome alignment, de-gapped manual alignments, MUSCLE genome alignment, linearized and unlinearized PASTA alignments, and the S-region alignment.

Genome_trees.zip

Tree files estimated from sequence alignments of hepatitis B virus genomes. Trees are best maximum likelihood (ML) trees with bootstrap support values. Includes trees based on MUSCLE, manual, and PASTA genome alignments.

Genome_consensus_sequence.fasta

Consensus sequence of hepatitis B virus genomes. This sequence was used as a reference for HBV manual alignments.

GenomeConsensus.fasta

Genotype_trees.zip

Tree files used for genotype occupancy tests in hepatitis B viruses. Trees estimated from manual or PASTA genome alignments. Files include .tre and .xml format.

GI_Clustering.zip

Initial files of hepatitis B virus sequences clustered according to GenBank GI number.

Supplementary_Table_S1.xlsx

Pairwise comparisons of hepatitis B virus phylogenies. Tree names (based on alignment type) are listed in the first two columns. Tree incompatibility ratios (T.ratio) and Robinson-Fould distances (RF dsitance) are listed for each tree comparison and boostrap support threshold (Cutoff).

hbv_supp_tableS1_compatability_rf.xlsx

Supplementary_Table_S2.xlsx

Genotype occupancy values for hepatitis B viruses. Genotype occupancy is the proportion of each genotype that makes up the minimum clade including all individuals of that genotype. Values are given for each tree (named according to alignment method), genotype, and bootstrap support cutoff for collapsing branches. The total number of sequences (i.e., tips in the tree) are indicated for each genotype.

hbv_supp_tableS2_genotype_occupancy.xlsx

Supplementary_Information.docx

List of commands used in software programs for the alignment and tree estimation of hepatitis B virus sequences.

hbv_suppinfo_commands.docx

Initial_GenBank_Download.gb

Initial GenBank download file of all hepatitis B virus sequences as of March 20, 2013.

initial_genbank_download_sequences.gb

S-region_trees.zip

Trees estimate from S-region alignments of hepatitis B viruses. Includes the best maximum likelihood tree with bootstrap support values, and a file with the bootstrap replicates.

S_region_trees.zip

Total_alignment_trees.zip

Trees estimated from total alignments (genomes + fragmentary sequences) of hepatitis B viruses. Includes trees estimated from manual and UPP alignments.

Total_alignments.zip

Sequence alignments of the total (genomes + fragmentary sequences) hepatitis B virus data set. Files include the manual alignment and both UPP alignments (manual genome alignment backbone, PASTA genome alignment backbone).

Data from: Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus

Data files

Abstract

Cleaned_GenBank_Files.zip

Genome_alignments.zip

Genome_trees.zip

Genome_consensus_sequence.fasta

Genotype_trees.zip

GI_Clustering.zip

Supplementary_Table_S1.xlsx

Supplementary_Table_S2.xlsx

Supplementary_Information.docx

Initial_GenBank_Download.gb

S-region_trees.zip

Total_alignment_trees.zip

Total_alignments.zip

Data from: Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus

Data files

Abstract

Usage notes

Cleaned_GenBank_Files.zip

Genome_alignments.zip

Genome_trees.zip

Genome_consensus_sequence.fasta

Genotype_trees.zip

GI_Clustering.zip

Supplementary_Table_S1.xlsx

Supplementary_Table_S2.xlsx

Supplementary_Information.docx

Initial_GenBank_Download.gb

S-region_trees.zip

Total_alignment_trees.zip

Total_alignments.zip

Works referencing this dataset