Skip to main content
Dryad

The effect of methodological considerations on the construction of gene-based plant pan-genomes

Data files

Mar 30, 2022 version files 33.97 GB
Oct 07, 2022 version files 35.70 GB
May 31, 2023 version files 38.98 GB
Jun 12, 2023 version files 39.19 GB

Abstract

Pan-genomics is an emerging approach for studying the genetic diversity within plant populations. In contrast to common resequencing studies that compare whole genome sequencing data to a single reference genome, the construction of a pan-genome involves the direct comparison of multiple genomes to one another, thereby enabling the detection of genomic sequences and genes not present in the reference, as well as the analysis of gene content diversity. While multiple studies describing pan-genomes of various plant species have been published in recent years, our understanding regarding the effect of the computational procedures used for pan-genome construction is still limited.

Here we examine the effect of several key methodological factors on the obtained gene pool and on gene presence-absence detections by constructing and comparing multiple pan-genomes of Arabidopsis thaliana and cultivated soybean, as well as conducting a meta-analysis on published pan-genomes. These factors include the construction method, the sequencing depth, and the extent of input data used for gene annotation. We observe substantial differences between pan-genomes constructed using three common procedures (De novo assembly and annotation, Map-to-pan, and Iterative assembly), and that results are dependent on the extent of the input data. Specifically, we report low agreement between the gene content inferred using different procedures and input data. Our results should increase the awareness of the community to the consequences of methodological decisions made during the process of pan-genome construction and emphasize the need for further investigation of commonly applied methodologies.