Evaluation of recombination detection methods for viral sequencing
Data files
Nov 27, 2023 version files 6.96 MB
-
README.md
610 B
-
sim.tar.gz
6.96 MB
Jan 28, 2024 version files 41.21 MB
Abstract
Recombination is a key evolutionary driver in shaping novel viral populations and lineages. When unaccounted for, recombination can impact evolutionary estimations, or complicate their interpretation. Therefore, identifying signals for recombination in sequencing data is a key prerequisite to further analyses. A repertoire of recombination detection methods have been developed over the past two decades, however, the prevalence of pandemic-scale viral sequencing data poses a computational challenge for existing methods. Here, we assessed five recombination detection methods (PhiPack (Profile), 3SEQ, GENECONV, VSEARCH (UCHIME), and gmos) to determine if any are suitable for the analysis of bulk sequencing data. To test the performance and scalability of these methods, we analysed simulated viral sequencing data across a range of sequence diversities, recombination frequencies, and sample sizes. Further, we provide a practical example for the analysis and validation of empirical data. We find that recombination detection methods need to be scalable, use an analytical approach and resolution that is suitable for the intended research application, and are accurate for the properties of a given dataset (e.g. sequence diversity and estimated recombination frequency). Analysis of simulated and empirical data revealed that the assessed methods exhibited considerable trade-offs between these criteria. Overall, we provide general guidelines for the validation of recombination detection results, the benefits and shortcomings of each assessed method, and future considerations for recombination detection methods for the assessment of large-scale viral sequencing data.
README
Simulated .fasta files by SANTA-SIM across different parameters.
All file names follow the structure of msa_m_r_n_dual_rep.fasta with a numeric value following each letter.
The parameters are:
m = mutation rate
r = recombination rate
n = number of sequences
dual = dual-infection probability
rep = replicate
The numeric indicates the parameter value it follows, for example, msa_m0.001_rc0.05_n100_dual1_rep2.fasta means that the alignment was simulated with the parameters:
mutation rate = 0.001
recombination rate = 0.05
number of sequences = 100
dual-infection probability = 1
replicate = 2
Files
performance.tar.gz - alignments for the performance benchmark
scale.tar.gz - alignments for the scalability benchmark