Nationwide real-world implementation of AI for cancer detection in population-based mammography screening (PRAIM)
Data files
Abstract
The PRAIM study (PRospective multicenter observational study of an integrated AI system with live Monitoring) assessed the impact of an AI-based decision support software on breast cancer screening outcomes. This Dryad data package contains the anonymized data from 461 818 screening cases across 12 screening sites in Germany. Variables include screening outcomes like cancer detection, use of AI software, radiologist assessments, cancer characteristics, and further metadata. The data can be used to reproduce the analyses on performance of AI-supported breast cancer screening versus standard of care published in Nature Medicine: Nationwide real-world implementation of AI for cancer detection in population-based mammography screening.
README: Nationwide real-world implementation of AI for cancer detection in population-based mammography screening (PRAIM) – Dataset
The PRAIM study (PRospective multicenter observational study of an integrated Artificial Intelligence system with live Monitoring) was a study conducted within the German breast cancer screening program from July 2021 to February 2023 to assess the impact of an AI-based decision support software. This dataset contains the data from PRAIM.
Context
The PRAIM study has been published in Nature Medicine. Please refer to the article Nationwide real-world implementation of AI for cancer detection in population-based mammography screening for further information on study design, results, and discussion of impact. The study has been previously registered in the German Clinical Trials Register and the study protocol can be found on the website of the University of Luebeck.
Below you can find information contextualizing the study and how it was embedded within the German breast screening program, as well as a description of the data and the access to the code/software to reproduce the results.
Context: Mammography screening in Germany
In Germany, women aged 50-69 are invited to breast cancer screening every two years. Each round of screening is called a screening examination. The goal of screening is to detect breast cancer early while it's still easily treatable and subsequently improve outcomes for women. After the mammography (x-ray images of the breasts), two radiologists independently assess the images (called a read) – with additional prior images from earlier screening rounds if available – to look for suspicious tissue in the breast. Optionally, there can be a third supervising radiologist, especially if one of the first two radiologists is still inexperienced. Each radiologist independently decides whether the examination is suspicious (warranting further diagnostic tests) or unsuspicious (screening process stops for the woman here, to be restarted two years later).
If at least one of the initial radiologists deems the examination suspicious, a consensus conference is initiated. In this conference, a group of radiologists together decides whether the woman should be recalled for further investigations like e.g. mammography magnifications, ultrasound, or MRI. If the examination is still suspicious after recall, a minimally invasive pre-operation biopsy is typically initiated for histopathological confirmation. If necessary, treatment is started subsequently.
Context: Vara
Vara is a Berlin-based company which offers an AI-supported viewer to aid radiologists when assessing breast cancer screening mammographies.
When using Vara, radiologists were supported by two AI-based features:
- Normal triaging: The software selects a subset of all mammograms (~60% of all examinations in PRAIM) that are deemed highly unsuspicious by the AI model. These examinations are tagged “normal“ in the worklist.
- Safety net: The software selects a subset of all examinations that are deemed highly suspicious by the AI model (~1.5% of all examinations in PRAIM). Radiologists first assess the screening examination without any AI tags. Only when radiologists interpret an examination as unsuspicious, the safety net is activated with an alert and a suggested localization of the suspicious region(s) in the images. Radiologists are then asked to reconsider their decision and can either accept or reject the safety net’s suggestion.
For some screening examinations neither the normal triaging nor the safety net is active. These examinations are left unclassified and therefore are read by radiologists without AI-support. Please refer to the supplementary material of the linked article for example images.
Context: PRAIM
The data for the PRAIM study was collected from July 1st, 2021 to February 23rd, 2023 from 12 screening units in Germany. Please note that the data is observational, i.e. there was no randomization: When assessing an examination, radiologists could decide whether they want to diagnose in Vara (with AI support) or in other software without AI support. The main analysis model (described in the article) controls for the two identified confounders – ai_prediction
and reader_set
(see description of columns below) – via propensity scores and overlap weighting in a simple regression model. For more details, please refer to the article.
The study was approved by the ethics committee of the University of Lübeck (22-043).
Description of the data and file structure
The dataset from the PRAIM study is encapsulated in a single CSV file, comprising 461 818 rows, each representing a unique screening examination (and also one participating woman) from the German breast cancer screening program. The data spans a range of information including patient demographics, screening outcomes, and AI model predictions. With 21 columns, the dataset details primary and secondary endpoints such as AI usage in screening, cancer detection, and recall rates, as well as metadata like screening dates and units, patient age, and breast density. Additionally, it includes cancer-specific data like invasiveness and stage, and safety net data indicating AI model alerts. This rich dataset allows for the verification of study results and further analysis of AI's role in breast cancer screening. The order of rows was randomized.
Please find a detailed description of the columns below:
study_id
: Auto-incrementing id from1
to461 818
.- Primary and secondary endpoints:
cancer_detected
:True
, when a cancer was detected in this screening examination (confirmed by histopathology),False
otherwise. Overall, there are 2 881 cancer cases (0.62% of the full population).had_recall
:True
, when the examination was recalled (invited for follow-up investigations like another mammography, ultrasound, MRI, or biopsy),False
otherwise. Roughly 4.2% of the full population has been recalled.
- Columns used in the main regression model:
used_ai_viewer
:True
, if at least one of the two/three reads was done in the AI-supported viewer,False
otherwise. This determines the PRAIM study arm: control arm without AI (201 079 examinations) or AI arm (260 739).reader_set
: Unique anonymized identifier for the radiologists who did the first and second read (with optional supervision read). Examples: When radiologist A and B read an examination, this might lead toreader_set = 1
. This id will be used for all examinations that radiologist A and B read together (which might be 100s or even 1000s). If another examination is read by radiologist A and C, this will yield a differentreader_set = 2
. If radiologist A and the inexperienced radiologist D read an examination together, where D is subsequently supervised by radiologist B, this will also yield a newreader_set
different from1
(where just A and B read together). In summary, each set of radiologists (might be two or three) are given exactly one unique id. Overall, there are 547 different sets of radiologists.readers
: Readers which participated in the reading of the current examination, separated by_
. For example, if the value is8_58
this means that radiologists 8 and 58 read this study. There can also be three readers (e.g.5_76_101
) if there was a supervision read. For 21 examinations, only one reader was documented. Overall, there are 119 different radiologists.ai_prediction
: The AI selects a subset of examinations that are confidently unsupicious/do not warrant further diagnostic examinations. These examinations are callednormal
(~57%), all othersnot-normal
(~43%). The AI prediction happens independently of whether the radiologists use the AI-supported viewer eventually, i.e. these predictions are also available for the control arm.
- Tertiary endpoints:
had_consensus_conference
:True
, when the examination had a consensus conference after at least one suspicious initial read,False
otherwise. The consensus conference decides whether or not to recall the woman. Roughly 12% of screening examinations had a consensus conference.had_pre_operation_biopsy
:True
, when the examination had a minimally-invasive pre-operative biopsy,False
otherwise. Typically happens after a recall. About 1% of woman had a pre-operative biopsy.
- Metadata:
screening_date
: Date of the screening examination, anonymized to the half-year (e.g.“2022-H2”
).screening_unit
: The screening unit the examination belongs to in anonymized form (e.g.“screening_unit_01”
). Overall, there are 12 screening units.age_at_screening
: The age of the screened woman at the time of screening, grouped into four buckets50-54
,55-59
,60-64
, and65-69
. For two women, the age was unavailable (MISSING
).screening_round
: Whether it's the first time the woman participated in mammography screening (first
, roughly 17% of examinations) or a consecutive screening round (follow-up
, roughly 83% of examinations).breast_density
: Density of the breast. Values are eitherdense
(~66%) ornon-dense
(~34%). For a few examinations, breast density is not available (it is not mandatory to fill out), these are marked asMISSING
.supervision
:True
, if the examination had an additional supervision read (~7%),False
otherwise (~93%).hardware_vendor
: Which hardware vendor was used for image acquisition. EitherFUJI
,GE
,HOLOGIC
,IMS_GIOTTO
,SIEMENS
, orUNKNOWN
.first_read
,second_read
: The assessment of the first and second reader during the initial reads, eithernormal
orsuspicious
. Supervision reads are filtered out.
- Cancer-specific data:
invasiveness
: Describes the histopathological appeareance of the breast cancer. Can be eitherINVASIVE
(invasive breast cancer),DCIS
(ductal carcinoma in situ), orOTHER
. Only available whencancer_detected = True
, otherwise the value isNOT_APPLICABLE
(also see guide on missing data below).invasive_cancer_size
: Size of the invasive cancer, if any. Grouped into three buckets: less than 10 mm, 10 to 20 mm, larger than 20 mm. Can beMISSING
for a few examinations due to lack of documentation. Only available for invasive breast cancer,NOT_APPLICABLE
otherwise.invasive_grade
: Grade of the cancer, describing how much the cancer cells look like normal cells once removed from the breast and checked in the lab. Values range fromG1
(well differentiated) toG3
(poorly differentiated). Value isCANNOT_BE_DETERMINED
when physicans tried but could not determine the grade. Value isMISSING
when there is documentation. Only available for invasive breast cancer,NOT_APPLICABLE
otherwise.cancer_stage
: Stage of the breast cancer, describing the extent or severity of an individual's cancer. Values range fromSTAGE_0
(early stage) toSTAGE_4
(late stage). The later the stage, the worse the prognosis on 5-year survival.STAGE_0
corresponds toDCIS
. Value isCANNOT_BE_DETERMINED
when physicians could not determine the stage. Value isMISSING
when there is no documentation. Only available for cancer cases,NOT_APPLICABLE
otherwise.
- Safety net data:
safety_net_triggered
:True
when the safety net was activated for this examination, i.e. when the model deemed this examination to be highly suspicious and recommends further diagnostics.False
otherwise.safety_net_shown
:True
when the radiologist initially deemed the examination as unsuspicious, but then subsequently was shown the safety net alert and a suggested localization in the AI-supported viewer.False
otherwise. This impliessafety_net_triggered
. Note that this value isTrue
once any of the up to three radiologists has been shown the safety net alert. Also note that this is only possible whenused_ai_viewer = True
.safety_net_accepted
:True
when the radiologist changed their opinion (from unsuspicious to suspicious) after being shown the safety net alert.False
otherwise. This impliessafety_net_shown
. Again, this value isTrue
once any of the up to three radiologists has been shown the safety net alert, and is only possible whenused_ai_viewer = True
.
Guide on missing data:
There are three levels of missingness and they are documented differently in the data set:
NOT_APPLICABLE
: This value is used when a certain column does not apply to a screening examination, e.g.invasiveness
can only be reported for actual cancers and does not make sense when the examination is normal. For healthy women, the value in the column is thereforeNOT_APPLICABLE
.CANNOT_BE_DETERMINED
: This value is used when physicians tried to determine a value, but failed to do so. For example,cancer_stage
can not always be determined when either the size, lymph node status, or metastasis status cannot be determined. In medical documentation, this is typically documented with anX
.MISSING
: This value is used when the corresponding data entry is empty in the official screening documentation.
Using this data set, consumers should be able to verify all results from the accompanying article.
Table 1 and 2 are descriptive, so can be immediately verified using the data. Tables 3, 4 and 5 need the study analysis model, please find the code for that below.
Dataset collection
The dataset was collected as part of Vara's quality assurance in the medical device' post-market-surveillance plan under the Medical Device Regulation (EU's legal framework for medical devices).
In Germany's mammography screening program and since its inception in the mid 2000s, screening data is constantly collected in two nationwide databases (depending on the federal state): “MaSc” software in 11 of the screening units participating in PRAIM and “MammaSoft” software in one participating screening unit. These databases support documenting the invitation management, diagnosis, and treatment and also facilititate the safe and secure data exchange with reference centers and cancer registries. Vara uses these databases – as well as the product-internal data (AI predictions, radiologist decisions, whether Vara was used etc.) – to collect data constantly in an anonymized quality assurance database.
In clinical practice, this database is used to monitor the medical device and ascertain its continued safety and effectiveness according to its intended use, e.g. distribution shifts of the AI model’s underlying prediction scores, the interaction of radiologists and AI (especially disagreement rates), as well as false-negatives and false-positives of both radiologists and the AI. This is also the database used for this study.
A few screening examinations had to be excluded from the dataset for reasons detailed in Figure 1 in the article. Apart from that, there was no filtering and the study had no exclusion criteria.
Code/Software
The software used to reproduce the results from the article is available in Zenodo: 10.5281/zenodo.10822135.
Acknowledgements
We sincerely thank Danalyn Byng for her early contributions in initiating the PRAIM study, especially on the study protocol and site onboarding. We are grateful for the support of the recruiting screening sites (Mittelrhein, Niedersachsen-Süd-West, Niedersachsen-Nord, Hannover, Herford/Minden-Lübbecke, Steinfurt, Köln rechtsrheinisch / Leverkusen, Wuppertal / Remscheid / Solingen / Mettmann, Niedersachsen-Mitte, Südwestliches Schleswig-Holstein, Wiesbaden, Niedersachsen-Nordwest). We are grateful for the invaluable technical support and data processing work provided by Dominik Schüler and Benjamin Strauch. Finally, we thank all women who contributed data to the study.