Template-specific optimization of NGS genotyping pipelines reveals allele-specific variation in MHC gene expression
Data files
Jan 29, 2024 version files 70.10 KB
-
Data_MHC-I.xlsx
40.36 KB
-
Data_MHC-II.xlsx
28.32 KB
-
README.md
1.42 KB
Abstract
Using high-throughput sequencing for precise genotyping of multi-locus gene families, such as the Major Histocompatibility Complex (MHC), remains challenging, due to the complexity of the data and difficulties in distinguishing genuine from erroneous variants. Several dedicated genotyping pipelines for data from high-throughput sequencing, such as next-generation sequencing (NGS), have been developed to tackle the ensuing risk of artificially inflated diversity. Here, we thoroughly assess three such multi-locus genotyping pipelines for NGS data, the DOC method, AmpliSAS and ACACIA, using MHC class IIβ datasets of three-spined stickleback gDNA, cDNA, and “artificial” plasmid samples with known allelic diversity. We show that genotyping of gDNA and plasmid samples at optimal pipeline parameters was highly accurate and reproducible across methods. However, for cDNA data, gDNA-optimal parameter configuration yielded decreased overall genotyping precision and consistency between pipelines. Further adjustments of key clustering parameters were required tο account for higher error rates and larger variation in sequencing depth per allele, highlighting the importance of template-specific pipeline optimization for reliable genotyping of multi-locus gene families. Through accurate paired gDNA-cDNA typing and MHC-II haplotype inference, we show that MHC-II allele-specific expression levels correlate negatively with allele number across haplotypes. Lastly, sibship-assisted cDNA-typing of MHC-I revealed novel variants linked in haplotype blocks and a higher-than-previously-reported individual MHC-I allelic diversity. In conclusion, we provide novel genotyping protocols for the three-spined stickleback MHC-I and -II genes and evaluate the performance of popular NGS-genotyping pipelines. We also show that fine-tuned genotyping of paired gDNA-cDNA samples facilitates amplification bias-corrected MHC allele expression analysis.
README: Template-specific optimization of NGS genotyping pipelines reveals allele-specific variation in MHC gene expression
https://doi.org/10.5061/dryad.qfttdz0qb
Description of the data and file structure
This submission consists of two Excel files.
The file 'Data_MHC-I' includes information regarding the 10 three-spined stickleback families included in our MHC-I genotyping dataset, and is separated into three sheets:
(i) Families overview, with information regarding the number of offspring and individual IDs of the families (columns: family ID, and corresponding offspring IDs)
(ii) Family genotypes (columns: Family ID, Inferred Parental Genotype1, Inferred Parental Genotype2, Observed Offspring Genotypes, Number of Alleles Per Genotype, and Number of Offspring), and
(iii) Allele segregation by family, where a table is presented for each of the 10 families used to infer the genetic linkage between MHC-I loci of the three-spined stickleback.
The file 'Data_MHC-II' includes the genotypes of all samples included in our MHC-II genotyping dataset. It is structured in one sheet (columns: Sample ID, Sample template, Data subset, Genotype, and Depth, followed by the names and corresponding depth (in reads) of all alleles called in each sample. All genotypes described were obtained at optimal AmpliSAS configuration for each template type (gDNA, cDNA, plasmid), as described in detail in our paper.