Skip to main content
Dryad logo

Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection


Tamisier, Lucie et al. (2021), Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection, Dryad, Dataset,


In the last decade, High-Throughput Sequencing (HTS) has revolutionized biology and medicine. This technology allows the sequencing of huge amount of DNA and RNA fragments at a very low price. In medicine, HTS tests for disease diagnostics are already brought into routine practice. However, the adoption in plant health diagnostics is still limited. One of the main bottlenecks is the lack of expertise and consensus on the standardization of the data analysis. The Plant Health Bioinformatic Network (PHBN) is an Euphresco project aiming to build a community network of bioinformaticians/computational biologists working in plant health. One of the main goals of the project is to develop reference datasets that can be used for validation of bioinformatics pipelines and for standardization purposes.

Semi-artificial datasets have been created for this purpose (Datasets 1 to 10). They are composed of a “real” HTS dataset spiked with artificial viral reads. It will allow researchers to adjust their pipeline/parameters as good as possible to approximate the actual viral composition of the semi-artificial datasets. Each semi-artificial dataset allows to test one or several limitations that could prevent virus detection or a correct virus identification from HTS data (i.e. low viral concentration, new viral species, non-complete genome).

Eight artificial datasets only composed of viral reads (no background data) have also been created (Datasets 11 to 18). Each dataset consists of a mix of several isolates from the same viral species showing different frequencies. The viral species were selected to be as divergent as possible. These datasets can be used to test haplotype reconstruction software, the goal being to reconstruct all the isolates present in a dataset.

A GitLab repository ( is available and provides a complete description of the composition of each dataset, the methods used to create them and their goals.

Usage Notes


These are the fastq files of the 18 datasets.

Description of the datasets

This is a word document describing each dataset.