Convergent evolution of artemisinin and chloroquine resistance in Ethiopian Plasmodium falciparum parasites
Data files
Jul 16, 2025 version files 242.08 KB
-
README.md
14.76 KB
-
Supplementary_Table_1.csv
23.63 KB
-
Supplementary_Table_2.csv
5.47 KB
-
Supplementary_Table_3A.csv
696 B
-
Supplementary_Table_3B.csv
876 B
-
Supplementary_Table_4.csv
22.58 KB
-
Supplementary_Table_5.csv
4.56 KB
-
Supplementary_Table_6.csv
22.48 KB
-
Supplementary_Tables.docx
147.02 KB
Abstract
The emergence of antimalarial drug resistance has threatened the control and elimination of malaria in Africa. Ethiopia, once a success story in case reduction, is now facing resurgence. This study examined the key drug resistance genes (Pfmdr1, Pfcrt, Pfk13, Pfdhfr, and Pfdhps) and mitochondrial genomes of 605 Plasmodium falciparum isolates from 15 districts across Ethiopia, varying in transmission intensity and P. vivax co-endemicity. A dominant PfMDR1 NFSND haplotype, associated with reduced lumefantrine susceptibility, was identified alongside near-fixation of the PfCRT CVIET chloroquine-resistant haplotype in specific areas. Concerningly, PfK13 variants associated with partial artemisinin resistance – R622I (10%), A675V (1.7%), and P441L (1.1%) – were expanding. Drug resistance markers were primarily found in settings with low transmission of P. falciparum and high levels of P. vivax coendemicity. Along with a distinct parasite lineage and limited gene flow, these findings suggest that local evolutionary and therapeutic pressures shape resistance. This underscores the urgent need for targeted genomic surveillance to guide tailored interventions and contain the spread of multidrug-resistant P. falciparum.
We have submitted our raw sequence data (available at ENA, see section 7), a detailed laboratory Drug resistance Tagets genes amplification protocol (Supplementary_Information.doc), extended data figures (Extended_Data.doc), and a comprehensive data file (Supplementary_Tables.xlsx) containing all prevalence data, statistical analyses, and sample metadata. Custom R and Python scripts used for the analyses are available on GitHub (see section 6).
Dataset for: Genomic Surveillance of Antimalarial Drug Resistance in Ethiopian Plasmodium falciparum
1. General Information
This dataset provides comprehensive genetic data on antimalarial drug resistance in Plasmodium falciparum, the deadliest malaria parasite. The data were generated from targeted deep-amplicon sequencing of 605 P. falciparum samples collected from 15 geographically diverse districts in Ethiopia between 2019 and 2023. The dataset includes the prevalence and geographic distribution of known and novel genetic mutations in five key drug resistance genes (Pfcrt, Pfdhfr, Pfdhps, Pfmdr1, Pfk13) and additional genetic regions (pfMt). It also contains analyses of how resistance patterns correlate with local malaria transmission intensity and the co-endemicity of P. vivax, another malaria parasite. This resource is critical for understanding the evolving landscape of drug resistance to inform public health strategies for malaria control and elimination in the region. Our study generated datasets that contain data related to targeted genomic surveillance of drug resistance genes and the mitochondrial genome in Plasmodium falciparum parasites in Ethiopia. raw sequence data (available at ENA, see section 7), a detailed laboratory Drug Resistance Targets genes amplification protocol (Supplementary_Information.doc), extended data figures (Extended_Data.doc), and a comprehensive data file (Supplementary_Tables.xlsx) containing all prevalence data, statistical analyses, and sample metadata. Custom R and Python scripts used for the analyses are available on GitHub (see section 6)
2. Study Background and Design
Scientific Context
Antimalarial drug resistance is a major threat to global malaria control. In Ethiopia, the situation is particularly complex due to several factors:
- Artemisinin Partial Resistance: The emergence of parasites partially resistant to artemisinin (the frontline treatment) is a global concern. This study surveils for the genetic markers associated with this resistance.
- P. vivax** Co-endemicity:** Ethiopia has a high burden of both P. falciparum and P. vivax malaria. The first-line treatment for P. vivax is chloroquine, a drug to which P. falciparum developed widespread resistance decades ago. The continued use of chloroquine for P. vivax may exert ongoing selective pressure on co-circulating P. falciparum parasites, potentially maintaining chloroquine resistance markers in the population.
- Variable Transmission: Malaria transmission intensity varies significantly across Ethiopia, which can influence the spread and maintenance of drug-resistant parasites.
This dataset was generated to monitor these dynamics by examining the genetic makeup of parasites across regions with different epidemiological profiles on the p. falciparum samples collected across 15 Districts.
Key Genes Surveyed
- Pfk13 (Plasmodium falciparum kelch 13): The primary molecular marker for artemisinin partial resistance. Mutations in its propeller domain are associated with delayed parasite clearance.
- Pfcrt (Plasmodium falciparum chloroquine resistance transporter): The primary determinant of chloroquine resistance. The CVIET haplotype is the classic resistance form.
- Pfmdr1 (Plasmodium falciparum multidrug resistance 1): Modulates resistance to multiple drugs, including chloroquine and lumefantrine (a partner drug in ACTs).
- Pfdhfr (dihydrofolate reductase) & Pfdhps (dihydropteroate synthase): Mutations in these genes confer resistance to the older antimalarial combination sulfadoxine-pyrimethamine (SP).
- pfcytb/pfMt (mitochondrial genome): Genetic regions associated with resistance to other antimalarials, such as atovaquone.
Key Epidemiological Metrics
- API (Annual Parasite Index): A measure of malaria transmission intensity, calculated as:
(Total confirmed malaria cases per year / Population at risk) * 1,000
. A higher API indicates more intense transmission. - P. vivax / P. falciparum Ratio: The proportion of malaria cases caused by P. vivax versus P. falciparum. This metric serves as a proxy for the level of chloroquine drug pressure in a region.
3. Terminology and Variable Guide
- Variant/Mutation: A specific change in the DNA sequence of a gene. e.g., N51I means the amino acid Asparagine (N) at position 51 has been replaced by Isoleucine (I).
- Haplotype: A combination of specific variants (mutations) that are inherited together on the same gene. e.g., CVIET is a haplotype of the Pfcrt gene composed of specific mutations at codons 72-76 that confer chloroquine resistance.
- Wild-type (Wt): The original, non-mutated form of a gene or haplotype, typically associated with drug sensitivity. e.g., CVMNK is the wild-type (chloroquine-sensitive) Pfcrt haplotype.
- n/N: A fraction where ‘n’ is the number of samples with a specific characteristic (e.g., a mutation) and ‘N’ is the total number of samples successfully analyzed for that characteristic.
- 95% CI (Confidence Interval): A range of values that likely contains the true population prevalence with 95% confidence.
- Chi² (Chi-squared): A statistical test used to determine if there is a significant association between two categorical variables, such as the prevalence of a mutation across different geographic districts.
- p-value: The probability of observing the data (or more extreme data) if there were no real effect or association. A low p-value (typically < 0.05) is considered statistically significant.
- ρ (rho): Spearman’s rank correlation coefficient. It measures the strength and direction of a relationship between two ranked variables. It ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation).
4. Description of Data Files
File: Supplementary_Information.doc
- Purpose: This document provides the complete laboratory methodology used to generate the genetic data. It is essential for understanding how the data was produced and for replicating the experiment.
- Content: A step-by-step protocol for the multiplex polymerase chain reaction (PCR) assays. This includes a full list of reagents and equipment, primer sequences for amplifying the seven target genetic regions (kelch13, mdr1, crt, dhfr, dhps,cytb/ Mt), reagent setup instructions, PCR master mix recipes, and specific thermocycler conditions for each of the two multiplex assays. It also includes troubleshooting guidance.
File: Extended_Data.docx
- Purpose: This file contains figures that provide high-level visual summaries and supplementary analyses of the data.
- Content:
- Extended Data 1 & 2 (Cluster Heatmaps): These heatmaps visualize mutation frequencies (Extended Data 1) and sequencing coverage (Extended Data 2) for each of the 605 samples. Rows represent individual parasite samples, and columns represent specific amino acid positions. The hierarchical clustering of rows groups genetically similar parasites together. Rows are color-coded by the sample’s district of origin, and columns are color-coded by gene, providing a rich, multi-layered view of the data.
- Extended Data 3 (Prevalence Table): This table is a high-level summary showing the prevalence of the most important resistance haplotypes in each of the 15 study districts.
- Extended Data 4 (Haplotype Network Diagrams): These diagrams visualize the evolutionary relationships between different genetic variants (haplotypes) for the four main drug resistance genes. Each circle represents a unique haplotype, with its size proportional to its frequency. Colors indicate the geographic origin of the samples. This helps to trace the potential origin and spread of resistance.
- Extended Data 5 (Correlation Bubble Heatmap): This heatmap visually summarizes the statistical analysis from Supplementary Table 3. It shows the relationship between resistance haplotype prevalence and local epidemiology. The color of the bubble indicates the correlation (green=positive, red=negative), while its size shows the overall frequency of that haplotype. Asterisks denote statistical significance.
File: Supplementary_Tables.docx (also in .csv formats Supplementary_Table_1.csv, Supplementary_Table_2.csv, Supplementary_Table_3A.csv, Supplementary_Table_3B.csv, Supplementary_Table_4.csv, Supplementary_Table_5.csv, Supplementary_Table_6.csv)
- Purpose: This spreadsheet is the core of the dataset, containing all the raw prevalence counts, statistical analyses, and sample metadata in a structured format. This spreadsheet file contains all core supplementary data organized into six distinct sheets.
- Supplementary Table 1 - Prevalence of Known Resistance Variants
- Purpose: This sheet documents the frequency of well-established antimalarial resistance mutations and haplotypes across the 15 Ethiopian study sites. It is the core data for understanding the geographic distribution of resistance.
- Variables:
Gene
: The drug resistance gene analyzed.Variant/Haplotype
: The specific mutation (e.g., N51I) or haplotype (e.g., AIRNI) being reported.[District Name] (n/N, %, 95% CI)
: Columns for each study district, showing the raw counts, prevalence percentage, and 95% confidence interval.Overall
: Aggregated data across all districts.Chi-squared statistic
andp-value
: Statistical results assessing whether the prevalence of a variant differs significantly across the districts.
- Supplementary Table 2 - Novel Missense Mutations
- Purpose: This sheet details the prevalence of novel or less-characterized non-synonymous (amino acid-changing) mutations found in the Pfmdr1 and Pfk13 genes. This is important for identifying potentially new or emerging resistance markers.
- Variables:
Gene
,Variant
,[District Name]
,Overall
,ChiX²
,p-value
: As described above.Significance
: A quick reference for statistical significance (***
for p < 0.001,*
for p < 0.05).
- Supplementary Tables 3A & 3B - Correlation Analysis
- Purpose: This sheet provides the numerical data behind the visualization in Extended Data 5. It investigates the potential drivers of resistance by correlating haplotype frequencies with local epidemiological conditions.
- Variables:
gene
,haplotype
: The genetic marker being analyzed.API
: Correlation results with the Annual Parasite Index.P. vivax/P. falciparum
: Correlation results with the ratio of P. vivax infections.ρ (rho)
: Spearman’s rank correlation coefficient.P value
(Table 3A) and95% CI
(Table 3B): Measures of the correlation’s significance and range.
- Supplementary Table 4 - Wild-Type Variant Prevalence
- Purpose: This sheet presents the “other side” of Table 1, showing the frequency of the original, drug-sensitive (wild-type) alleles at key resistance codons. This is useful for identifying regions where parasites may still be susceptible to older drugs.
- Variables: Same structure as Table 1, but reporting on wild-type alleles.
- Supplementary Table 5 - Co-occurrence of Mutations
- Purpose: This sheet analyzes which pairs of mutations are found together in the same parasite more often than expected by chance. This can suggest functional interaction or might genetic linkage between mutations.
- Variables:
Gene1
,Variant1
: The first gene and mutation in the pair.Overall Total Counts of Isolate with Mutation1
: The total number of isolates across the dataset that carry Mutation1.Gene2
,Variant2
: The second gene and mutation in the pair.Overall Total Counts of Isolate with Mutation2
: The total number of isolates that carry Mutation2.Observed Counts of Co-occurred Mutation pair
: The number of isolates that carry both Mutation1 and Mutation2.Same Gene
: A boolean field (TRUE/FALSE) indicating if the two mutations are in the same gene.Chi-squared statistic
: The statistic from the test of association for co-occurrence.P-value for Co-occurring Mutation pair
: The p-value indicating the significance of the co-occurrence.
- Supplementary Table 6 - Sample IDs and Metadata
- Purpose: This sheet serves as the master key, linking the internal sample IDs to their public accession numbers in the European Nucleotide Archives (ENA), along with collection year and location.
- Variables:
ENA_id
: The unique accession ID for the raw sequence data in the ENA database.Sample name
: The internal identifier used in this study.Collection year
: Year the sample was collected.District
: The district in Ethiopia of the sample origin.
5. Ethical Approval
Written informed consent was obtained from all study participants and their parents or guardians before recruitment. All personally identifiable information was removed from the dataset. The study protocol was approved by the Armauer Hansen Research Institute and ALERT Hospital Ethics Review Committee (PO/46/20) and the University of Notre Dame Institutional Review Board (approval no. 20-12-6350).
6. Code and Software
Custom scripts used for the bioinformatics and statistical analyses are publicly available on GitHub.
- Bioinformatics Analysis: https://github.com/leenvh/EMAGEN
- Identity-by-Descent (IBD) Analysis: https://github.com/LSHTMPathogenSeqLab/malaria-hub
7. Data Access and Citation
Raw sequence data for the 604 sequenced isolates in this study can be found in the European Nucleotide Archives (ENA) under project accession number PRJEB87415. The sample names, accession numbers, and metadata are listed in Sheet 6 of the Supplementary_Tables.xlsx
file.
Human subjects data
Written informed consent was obtained from all study participants and their parents or guardians before recruitment. All personally identifiable information was removed in the dataset.
Surveillance sites and sample collection
Fifteen sites across Ethiopia's administrative districts were selected for genomic surveillance of antimalarial resistance (Fig. 1A). Site selection prioritized settings that reported malaria outbreaks [31,68,69], markers of drug-resistant parasites [34],high pfhrp2/3 deletion rates [34], representation of the transmission strata of the country (ranging from low and unstable to high, Fig. 1B), varying P. vivax co-endemicity with P. falciparum (Fig. 1C), and areas prone to importation of parasites from neighboring countries.
A total of 605 blood samples were collected from symptomatic P. falciparum between October 2019 and January 2023. Finger-prick blood samples (0.5 mL) collected in EDTA microtainer tubes were used for malaria diagnosis by microscopy and to prepare dried blood spots (DBS) on Whatman 3MM filter paper.
Ethical Consideration
The study protocol was approved by the Armauer Hansen Research Institute and ALERT Hospital Ethics Review Committee (PO/46/20) and the University of Notre Dame Institutional Review Board (approval no. 20-12-6350). Written informed consent was obtained from all study participants and their parents or guardians before recruitment. The study adhered to the Declaration of Helsinki principles.
DNA extraction, P. falciparum species confirmation, parasite quantification
Genomic DNA was extracted from 6 mm diameter punches of DBS using a MagMAX magnetic bead-based DNA multi-sample kit on a Kingfisher Flex robotic extractor machine (Thermo Fisher Scientific) following the manufacturer’s protocol and eluted in 150 µL low-salt elution buffer. Quantitative polymerase chain reaction[P1] (qPCR) targeting the P. falciparum 18S rRNA small subunit gene for P. falciparum was performed using primer and probe sequences as previously described [70], using TaqMan Fast Advanced Master Mix (Applied Biosystems). P. falciparum parasites were quantified using standard curves generated from a serial dilution of NF54 ring-stage parasites (10^6 to 10^3 parasites/mL). Samples with parasitemia of 50 parasite/μL or higher were selected for drug resistance marker genotyping.
Amplification of Antimalarial Drug Resistance Genes, Sample and Library Preparation, and Targeted Deep-Amplicon Sequencing
Amplicon sequencing targeting the entire length of six antimalarial drug resistance genes (Pfcrt, Pfmdr1, Pfk13, Pfdhfr, and Pfdhps) and the mitochondrial genome were employed as described before [71] with few modifications as detailed in Supplementary Information; Single gene PCR amplification assays were modified into two multiplex PCR reactions: Multiplex-I for Pfk13, Pfmdr1, and Pfcytb genes; and Multiplex-II for PfMit, Pfcrt, Pfdhps and Pfdhfr genes amplification. Normalized amplicons pooled from each patient sample were used for library preparation using the Nextera DNA Flex Library Preparation Kit (Illumina, USA). Seven PCR cycles were used for indexing followed by standard purification of the final libraries. Library quantification was performed using a Qubit 4.0 fluorometer with a DNA HS kit (Invitrogen, USA), and fragment size distribution was assessed using a Bioanalyzer 2100 with a DNA HS assay kit (Agilent, Germany). All libraries were normalized to 4 nM, pooled, and sequenced on a NextSeq 500/550 (Illumina, Singapore) using 2 × 149 bp paired-end reads. In total, 605 P. falciparumisolates were successfully amplified and pooled, and sequencing data were obtained for 604 isolates.
Sequence alignment, variant and haplotype identification
Raw Illumina paired-end reads were quality-controlled using the next-generation Sequence Analysis Toolkit (https://github.com/yyr4/Nf-NeST/). Reads with a Phred score below 20 and length below 100 bp were removed. SNPs were identified using the same pipeline, which employs an ensemble of four variant callers (Samtools, freeBayes, HaplotypeCaller, and VarDict_v1.8.2) to increase the detection confidence. Known drug-resistance-associated SNPs were filtered for a minimum read coverage of 5´ at the corresponding position. Novel SNPs were reported only if they were detected by at least two variant callers and were present in at least ten samples. After quality control and removal of low-depth coverage samples, 586 isolates were retained for downstream analysis. Haplotypes were constructed for Pfcrt (codons 72-76), Pfmdr1 (codons 86, 184, 1034, 1042, and 1246), Pfdhps (codons 431, 436, 437, 540, 581, and 613), and Pfdhfr (codons 16, 51, 59, 108, and 164) using a variant allele frequency threshold of ≥95% for mutant alleles and ≤5% for wild-type alleles [73]. Samples that did not meet these criteria, which are potentially indicative of polyclonal infections, were excluded from the haplotype analysis.
Statistical and Population genetic analyses
Field data were initially collected using paper-based forms and subsequently double-entered into RedCap (mobile application version 5.20.11). Any discrepancies identified during the data entry process were resolved by cross-referencing the original paper forms and reaching a consensus among the data managers.
Associations between haplotype frequencies, the annual parasite index (API), and the fraction of P. vivax and total infections were analyzed using Spearman’s correlation. To adjust for small sample sizes and to achieve robust estimates, 200 bootstrap resampling iterations were used. The API was based on the total confirmed malaria cases and was calculated as follows: (total confirmed malaria cases per year ÷ population at risk) × 1,000. Co-occurrence patterns of key missense mutation pairs were evaluated using chi-square tests, highlighting statistically significant associations. Geographic variation in SNP and haplotype frequencies was assessed using chi-squared tests (statistical significance: p < 0.05), and the Wilson score interval method was used to calculate 95% confidence intervals. The median and interquartile range (IQR) were used to summarize the age distribution across sites.
For haplotype analysis of P. falciparum drug resistance genes (Pfcrt, Pfmdr1, Pfk13, Pfdhps, Pfdhfr), full-gene amplicon sequences were aligned using MAFFT software [74]. Haplotype networks were constructed using R package pegas [75]. SNP-based principal component analysis (PCA) was conducted using the scikit-allel Python package [76]. Publicly available P. falciparum genomic sequences were obtained from previous studies in East Africa and the Horns of Africa [77–79]. A neighbor-joining tree was created using pairwise genetic distance matrices and ape R package [75]. Identity-by-descent (IBD) was used to assess relatedness between isolates by estimating the pairwise fraction of shared ancestry between the sequenced segments. This was done using the hmmIBD software, which applies a hidden Markov model-based approach to estimate shared ancestry while taking recombination into account [81]. Network graphs were created using the igraph R package (v.1.3.5) [82]. The codes used for the bioinformatics and statistical analyses are available at https://github.com/leenvh/EMAGEN.