Skip to main content
Dryad logo

Harnessing clinical annotations to improve deep learning performance in prostate segmentation


Sarma, Karthik V. et al. (2021), Harnessing clinical annotations to improve deep learning performance in prostate segmentation, Dryad, Dataset,



Developing large-scale datasets with research-quality annotations is challenging due to the high cost of refining clinically generated markup into high precision annotations. We evaluated the direct use of a large dataset with only clinically generated annotations in development of high-performance segmentation models for small research-quality challenge datasets.

Materials and methods

We used a large retrospective dataset from our institution comprised of 1,620 clinically generated segmentations, and two challenge datasets (PROMISE12: 50 patients, ProstateX-2: 99 patients). We trained a 3D U-Net convolutional neural network (CNN) segmentation model using our entire dataset, and used that model as a template to train models on the challenge datasets. We also trained versions of the template model using ablated proportions of our dataset, and evaluated the relative benefit of those templates for the final models. Finally, we trained a version of the template model using an out-of-domain brain cancer dataset, and evaluated the relevant benefit of that template for the final models. We used five-fold cross-validation (CV) for all training and evaluation across our entire dataset.


Our model achieves state-of-the-art performance on our large dataset (mean overall Dice 0.916, average Hausdorff distance 0.135 across CV folds). Using this model as a pre-trained template for refining on two external datasets significantly enhanced performance (30% and 49% enhancement in Dice scores respectively). Mean overall Dice and mean average Hausdorff distance were 0.912 and 0.15 for the ProstateX-2 dataset, and 0.852 and 0.581 for the PROMISE12 dataset. Using even small quantities of data to train the template enhanced performance, with significant improvements using 5% or more of the data.


We trained a state-of-the-art model using unrefined clinical prostate annotations and found that its use as a template model significantly improved performance in other prostate segmentation tasks, even when trained with only 5% of the original dataset.

Usage Notes

This json-formatted file is a dictionary of results for the experiments in the paper linked to this dataset, including overall soft Dice coefficients, region soft Dice coefficients, and average Hausdorff distances. 

A dictionary entry exists for every trained and evaluated model in the experiment set. Each specific model entry is in this structure:

model_entry = {
                "overall_dice_scores": Soft Dice coefficient over the whole volume,
                "overall_AHDs": Average Hausdorff distance for the volume,
                "region_dice_scores": Soft Dice coefficient for each region in the image

The following is a general outline of the dataset:

dataset = {
    location_map: A mapping of region IDs to the name of the region,
    "baseline_models": Experimental results from Table 1.
    "retargeted_models": {
        "unrefined": Experimental results from Table 2 (no refining).
        "refined": Experimental results from Table 2 (with refining).
    "ablation_models": Experimental results from the ablation models at every data proportion (Table 3).
    "brats_models": Experimental results from Table 4.


National Cancer Institute, Award: F30CA210329

National Institute of General Medical Sciences, Award: GM08042

National Cancer Institute, Award: R21CA220352

National Cancer Institute, Award: P50CA092131

National Cancer Institute, Award: R01CA195505

National Cancer Institute, Award: R01CA158627

National Cancer Institute, Award: HHSN261200800001E

UCLA-Caltech Medical Scientist Training Program

NIH, Award: Intramural Research Program