Nationwide real-world implementation of AI for cancer detection in population-based mammography screening (PRAIM)

Eisemann, Nora1 ; Bunk, Stefan 2 ; Mukama, Trasias2 ; Baltus, Hannah1 ; Elsner, Susanne1 ; Gomille, Timo3 ; Hecht, Gerold4 ; Heywang-Köbrunner, Sylvia5 ; Rathmann, Regine6 ; Siegmann-Luz, Katja7 ; Töllner, Thilo8 ; Vomweg, Toni Werner9 ; Leibig, Christian2 ; Katalinic, Alexander 1

Research facility: University of Lübeck

Published Sep 11, 2024 on Dryad. https://doi.org/10.5061/dryad.zs7h44jgn

Data files

Sep 11, 2024 version files 98.49 MB

praim.csv

98.47 MB
README.md

16.95 KB

Sep 11, 2024 version files 98.49 MB

praim.csv

98.47 MB
README.md

17.31 KB

Abstract

The PRAIM study (PRospective multicenter observational study of an integrated AI system with live Monitoring) assessed the impact of an AI-based decision support software on breast cancer screening outcomes. This Dryad data package contains the anonymized data from 461 818 screening cases across 12 screening sites in Germany. Variables include screening outcomes like cancer detection, use of AI software, radiologist assessments, cancer characteristics, and further metadata. The data can be used to reproduce the analyses on performance of AI-supported breast cancer screening versus standard of care published in Nature Medicine: Nationwide real-world implementation of AI for cancer detection in population-based mammography screening.

The PRAIM study (PRospective multicenter observational study of an integrated Artificial Intelligence system with live Monitoring) was a study conducted within the German breast cancer screening program from July 2021 to February 2023 to assess the impact of an AI-based decision support software. This dataset contains the data from PRAIM.

Context

The PRAIM study has been published in Nature Medicine. Please refer to the article Nationwide real-world implementation of AI for cancer detection in population-based mammography screening for further information on study design, results, and discussion of impact. The study has been previously registered in the German Clinical Trials Register and the study protocol can be found on the website of the University of Luebeck.
Below you can find information contextualizing the study and how it was embedded within the German breast screening program, as well as a description of the data and the access to the code/software to reproduce the results.

Context: Mammography screening in Germany

In Germany, women aged 50-69 are invited to breast cancer screening every two years. Each round of screening is called a screening examination. The goal of screening is to detect breast cancer early while it's still easily treatable and subsequently improve outcomes for women. After the mammography (x-ray images of the breasts), two radiologists independently assess the images (called a read) – with additional prior images from earlier screening rounds if available – to look for suspicious tissue in the breast. Optionally, there can be a third supervising radiologist, especially if one of the first two radiologists is still inexperienced. Each radiologist independently decides whether the examination is suspicious (warranting further diagnostic tests) or unsuspicious (screening process stops for the woman here, to be restarted two years later).
If at least one of the initial radiologists deems the examination suspicious, a consensus conference is initiated. In this conference, a group of radiologists together decides whether the woman should be recalled for further investigations like e.g. mammography magnifications, ultrasound, or MRI. If the examination is still suspicious after recall, a minimally invasive pre-operation biopsy is typically initiated for histopathological confirmation. If necessary, treatment is started subsequently.

Context: Vara

Vara is a Berlin-based company which offers an AI-supported viewer to aid radiologists when assessing breast cancer screening mammographies.
When using Vara, radiologists were supported by two AI-based features:

Normal triaging: The software selects a subset of all mammograms (~60% of all examinations in PRAIM) that are deemed highly unsuspicious by the AI model. These examinations are tagged “normal“ in the worklist.
Safety net: The software selects a subset of all examinations that are deemed highly suspicious by the AI model (~1.5% of all examinations in PRAIM). Radiologists first assess the screening examination without any AI tags. Only when radiologists interpret an examination as unsuspicious, the safety net is activated with an alert and a suggested localization of the suspicious region(s) in the images. Radiologists are then asked to reconsider their decision and can either accept or reject the safety net’s suggestion.

For some screening examinations neither the normal triaging nor the safety net is active. These examinations are left unclassified and therefore are read by radiologists without AI-support. Please refer to the supplementary material of the linked article for example images.

Context: PRAIM

The data for the PRAIM study was collected from July 1st, 2021 to February 23rd, 2023 from 12 screening units in Germany. Please note that the data is observational, i.e. there was no randomization: When assessing an examination, radiologists could decide whether they want to diagnose in Vara (with AI support) or in other software without AI support. The main analysis model (described in the article) controls for the two identified confounders – ai_prediction and reader_set (see description of columns below) – via propensity scores and overlap weighting in a simple regression model. For more details, please refer to the article.

The study was approved by the ethics committee of the University of Lübeck (22-043).

Description of the data and file structure

The dataset from the PRAIM study is encapsulated in a single CSV file, comprising 461 818 rows, each representing a unique screening examination (and also one participating woman) from the German breast cancer screening program. The data spans a range of information including patient demographics, screening outcomes, and AI model predictions. With 21 columns, the dataset details primary and secondary endpoints such as AI usage in screening, cancer detection, and recall rates, as well as metadata like screening dates and units, patient age, and breast density. Additionally, it includes cancer-specific data like invasiveness and stage, and safety net data indicating AI model alerts. This rich dataset allows for the verification of study results and further analysis of AI's role in breast cancer screening. The order of rows was randomized.

Please find a detailed description of the columns below:

study_id: Auto-incrementing id from 1 to 461 818.
Primary and secondary endpoints:
- cancer_detected: True, when a cancer was detected in this screening examination (confirmed by histopathology), False otherwise. Overall, there are 2 881 cancer cases (0.62% of the full population).
- had_recall: True, when the examination was recalled (invited for follow-up investigations like another mammography, ultrasound, MRI, or biopsy), False otherwise. Roughly 4.2% of the full population has been recalled.
Columns used in the main regression model:
- used_ai_viewer: True, if at least one of the two/three reads was done in the AI-supported viewer, False otherwise. This determines the PRAIM study arm: control arm without AI (201 079 examinations) or AI arm (260 739).
- reader_set: Unique anonymized identifier for the radiologists who did the first and second read (with optional supervision read). Examples: When radiologist A and B read an examination, this might lead to reader_set = 1. This id will be used for all examinations that radiologist A and B read together (which might be 100s or even 1000s). If another examination is read by radiologist A and C, this will yield a different reader_set = 2. If radiologist A and the inexperienced radiologist D read an examination together, where D is subsequently supervised by radiologist B, this will also yield a new reader_set different from 1 (where just A and B read together). In summary, each set of radiologists (might be two or three) are given exactly one unique id. Overall, there are 547 different sets of radiologists.
- readers: Readers which participated in the reading of the current examination, separated by _. For example, if the value is 8_58 this means that radiologists 8 and 58 read this study. There can also be three readers (e.g. 5_76_101) if there was a supervision read. For 21 examinations, only one reader was documented. Overall, there are 119 different radiologists.
- ai_prediction: The AI selects a subset of examinations that are confidently unsupicious/do not warrant further diagnostic examinations. These examinations are called normal (~57%), all others not-normal (~43%). The AI prediction happens independently of whether the radiologists use the AI-supported viewer eventually, i.e. these predictions are also available for the control arm.
Tertiary endpoints:
- had_consensus_conference: True, when the examination had a consensus conference after at least one suspicious initial read, False otherwise. The consensus conference decides whether or not to recall the woman. Roughly 12% of screening examinations had a consensus conference.
- had_pre_operation_biopsy: True, when the examination had a minimally-invasive pre-operative biopsy, False otherwise. Typically happens after a recall. About 1% of woman had a pre-operative biopsy.
Metadata:
- screening_date: Date of the screening examination, anonymized to the half-year (e.g. “2022-H2”).
- screening_unit: The screening unit the examination belongs to in anonymized form (e.g. “screening_unit_01”). Overall, there are 12 screening units.
- age_at_screening: The age of the screened woman at the time of screening, grouped into four buckets 50-54, 55-59, 60-64, and 65-69. For two women, the age was unavailable (MISSING).
- screening_round: Whether it's the first time the woman participated in mammography screening (first, roughly 17% of examinations) or a consecutive screening round (follow-up, roughly 83% of examinations).
- breast_density: Density of the breast. Values are either dense (~66%) or non-dense (~34%). For a few examinations, breast density is not available (it is not mandatory to fill out), these are marked as MISSING.
- supervision: True, if the examination had an additional supervision read (~7%), False otherwise (~93%).
- hardware_vendor: Which hardware vendor was used for image acquisition. Either FUJI, GE, HOLOGIC, IMS_GIOTTO, SIEMENS, or UNKNOWN.
- first_read, second_read: The assessment of the first and second reader during the initial reads, either normal or suspicious. Supervision reads are filtered out.
Cancer-specific data:
- invasiveness: Describes the histopathological appeareance of the breast cancer. Can be either INVASIVE (invasive breast cancer), DCIS (ductal carcinoma in situ), or OTHER. Only available when cancer_detected = True, otherwise the value is NOT_APPLICABLE (also see guide on missing data below).
- invasive_cancer_size: Size of the invasive cancer, if any. Grouped into three buckets: less than 10 mm, 10 to 20 mm, larger than 20 mm. Can be MISSING for a few examinations due to lack of documentation. Only available for invasive breast cancer, NOT_APPLICABLE otherwise.
- invasive_grade: Grade of the cancer, describing how much the cancer cells look like normal cells once removed from the breast and checked in the lab. Values range from G1 (well differentiated) to G3 (poorly differentiated). Value is CANNOT_BE_DETERMINED when physicans tried but could not determine the grade. Value is MISSING when there is documentation. Only available for invasive breast cancer, NOT_APPLICABLE otherwise.
- cancer_stage: Stage of the breast cancer, describing the extent or severity of an individual's cancer. Values range from STAGE_0 (early stage) to STAGE_4 (late stage). The later the stage, the worse the prognosis on 5-year survival. STAGE_0 corresponds to DCIS. Value is CANNOT_BE_DETERMINED when physicians could not determine the stage. Value is MISSING when there is no documentation. Only available for cancer cases, NOT_APPLICABLE otherwise.
Safety net data:
- safety_net_triggered: True when the safety net was activated for this examination, i.e. when the model deemed this examination to be highly suspicious and recommends further diagnostics. False otherwise.
- safety_net_shown: True when the radiologist initially deemed the examination as unsuspicious, but then subsequently was shown the safety net alert and a suggested localization in the AI-supported viewer. False otherwise. This implies safety_net_triggered. Note that this value is True once any of the up to three radiologists has been shown the safety net alert. Also note that this is only possible when used_ai_viewer = True.
- safety_net_accepted: True when the radiologist changed their opinion (from unsuspicious to suspicious) after being shown the safety net alert. False otherwise. This implies safety_net_shown. Again, this value is True once any of the up to three radiologists has been shown the safety net alert, and is only possible when used_ai_viewer = True.

Guide on missing data:
There are three levels of missingness and they are documented differently in the data set:

NOT_APPLICABLE: This value is used when a certain column does not apply to a screening examination, e.g. invasiveness can only be reported for actual cancers and does not make sense when the examination is normal. For healthy women, the value in the column is therefore NOT_APPLICABLE.
CANNOT_BE_DETERMINED: This value is used when physicians tried to determine a value, but failed to do so. For example, cancer_stage can not always be determined when either the size, lymph node status, or metastasis status cannot be determined. In medical documentation, this is typically documented with an X.
MISSING: This value is used when the corresponding data entry is empty in the official screening documentation.

Using this data set, consumers should be able to verify all results from the accompanying article.
Table 1 and 2 are descriptive, so can be immediately verified using the data. Tables 3, 4 and 5 need the study analysis model, please find the code for that below.

Dataset collection

The dataset was collected as part of Vara's quality assurance in the medical device' post-market-surveillance plan under the Medical Device Regulation (EU's legal framework for medical devices).
In Germany's mammography screening program and since its inception in the mid 2000s, screening data is constantly collected in two nationwide databases (depending on the federal state): “MaSc” software in 11 of the screening units participating in PRAIM and “MammaSoft” software in one participating screening unit. These databases support documenting the invitation management, diagnosis, and treatment and also facilititate the safe and secure data exchange with reference centers and cancer registries. Vara uses these databases – as well as the product-internal data (AI predictions, radiologist decisions, whether Vara was used etc.) – to collect data constantly in an anonymized quality assurance database.
In clinical practice, this database is used to monitor the medical device and ascertain its continued safety and effectiveness according to its intended use, e.g. distribution shifts of the AI model’s underlying prediction scores, the interaction of radiologists and AI (especially disagreement rates), as well as false-negatives and false-positives of both radiologists and the AI. This is also the database used for this study.

A few screening examinations had to be excluded from the dataset for reasons detailed in Figure 1 in the article. Apart from that, there was no filtering and the study had no exclusion criteria.

Code/Software

The software used to reproduce the results from the article is available in Zenodo: 10.5281/zenodo.10822135.

Acknowledgements

We sincerely thank Danalyn Byng for her early contributions in initiating the PRAIM study, especially on the study protocol and site onboarding. We are grateful for the support of the recruiting screening sites (Mittelrhein, Niedersachsen-Süd-West, Niedersachsen-Nord, Hannover, Herford/Minden-Lübbecke, Steinfurt, Köln rechtsrheinisch / Leverkusen, Wuppertal / Remscheid / Solingen / Mettmann, Niedersachsen-Mitte, Südwestliches Schleswig-Holstein, Wiesbaden, Niedersachsen-Nordwest). We are grateful for the invaluable technical support and data processing work provided by Dominik Schüler and Benjamin Strauch. Finally, we thank all women who contributed data to the study.

Eisemann, Nora; Baltus, Hannah (2024), Code & supporting documents for "Artificial intelligence in breast cancer screening: Results of a nationwide real-world prospective cohort study (PRAIM)", , Book, https://doi.org/10.5281/zenodo.10822136
Eisemann, Nora; Baltus, Hannah (2024), Code & supporting documents for "Artificial intelligence in breast cancer screening: Results of a nationwide real-world prospective cohort study (PRAIM)", , Book, https://doi.org/10.5281/zenodo.10891103
Eisemann, Nora; Baltus, Hannah (2024), Code & supporting documents for "Artificial intelligence in breast cancer screening: Results of a nationwide real-world prospective cohort study (PRAIM)", , Book, https://doi.org/10.5281/zenodo.13143528
Eisemann, Nora; Baltus, Hannah (2024), Code & supporting documents for "Artificial intelligence for cancer detection in population-based mammography screening: Results of a nationwide real-world prospective cohort study (PRAIM)", , Book, https://doi.org/10.5281/zenodo.13171924
Eisemann, Nora; Baltus, Hannah (2024), Code & supporting documents for "Artificial intelligence for cancer detection in population-based mammography screening: Results of a nationwide real-world prospective cohort study (PRAIM)", , Book, https://doi.org/10.5281/zenodo.13734850
Eisemann, Nora; Baltus, Hannah (2024), Code & supporting documents for "Artificial intelligence for cancer detection in population-based mammography screening: Results of a nationwide real-world prospective cohort study (PRAIM)", , Book, https://doi.org/10.5281/zenodo.10822135
Eisemann, Nora; Baltus, Hannah (2025). Code & supporting documents for "Nationwide real-world implementation of AI for cancer detection in population-based mammography screening (PRAIM)". Zenodo. https://doi.org/10.5281/zenodo.13952914
Eisemann, Nora; Bunk, Stefan; Mukama, Trasias et al. (2025). Nationwide real-world implementation of AI for cancer detection in population-based mammography screening. Nature Medicine. https://doi.org/10.1038/s41591-024-03408-6
Yu, Jiuhong; Cao, Jifeng; Budría, Santiago; Zhou, Haigang (2025). Artificial Intelligence: A Solution to Enhance Sustainability in a Turbulent World?. Emerging Markets Finance and Trade. https://doi.org/10.1080/1540496x.2025.2596218