Baseline AI models for 3D csPCa detection/diagnosis in bpMRI: Updates and Fixes ¶

By: joeran.bosma on June 15, 2022, 9:13 a.m.

The PI-CAI: Baseline AI models for 3D csPCa detection/diagnosis in bpMRI are now online! This repository contains utilities to set up and train deep learning-based detection models for clinically significant prostate cancer (csPCa) in MRI. In turn, these models serve as the official baseline AI solutions for the PI-CAI challenge. As of now, the following three models are provided and supported:

U-Net
nnU-Net
nnDetection

Please monitor this thread for all updates and fixes regarding these models. Also, please feel free to share any issues you come across in this thread!

Re: Baseline AI models for 3D csPCa detection/diagnosis in bpMRI: Updates and Fixes ¶

By: JMitura on July 13, 2022, 12:08 p.m.

Hello does the mean DICE metric for segmentation of the baseline algorithm is also published somewhere, or just the overall score? Thanks!

Last edited by: JMitura on Aug. 15, 2023, 12:56 p.m., edited 1 time in total.

Re: Baseline AI models for 3D csPCa detection/diagnosis in bpMRI: Updates and Fixes ¶

By: joeran.bosma on July 13, 2022, 5:18 p.m.

Hi JMitura,

We do not compute nor report the Dice similarity coefficient (DSC) during model evaluation. The DSC has been shown to be ineffective at evaluating csPCa detection performance. (See e.g. Yan, Wen, et al. "The impact of using voxel-level segmentation metrics on evaluating multifocal prostate cancer localisation." arXiv preprint arXiv:2203.16415 (2022).)

Kind regards, Joeran

Re: Baseline AI models for 3D csPCa detection/diagnosis in bpMRI: Updates and Fixes ¶

By: JMitura on July 15, 2022, 6:12 p.m.

Thank you for sharing your expertise !

Re: Baseline AI models for 3D csPCa detection/diagnosis in bpMRI: Updates and Fixes ¶

By: Doris on Aug. 22, 2022, 5:58 a.m.

Hello, thanks for your work in this challenge. For original nnUNet model, 2D U-Net is supported. Is that possible to use the nnUNet baseline in 2D mode? If so, how can I do that?

Last edited by: Doris on Aug. 15, 2023, 12:57 p.m., edited 1 time in total.

Re: Baseline AI models for 3D csPCa detection/diagnosis in bpMRI: Updates and Fixes ¶

By: joeran.bosma on Aug. 22, 2022, 7:51 a.m.

Hi Doris,

We have not tried the 2D nnU-Net during development of the baseline, so it is not supported from our end. However, there is no reason why it would not be possible. It may be easiest to use the official nnU-Net installation for this, as this allows following the official nnU-Net documentation on 2D network training. You can also use the wrapper provided by us (as used in the public baseline). Both have their pros and cons, and it's up to you to decide which you find easier to work with. You can check out the source code to see how our wrapper translates to the official nnU-Net commands.

I have included some steps that hopefully will help you use the 2D nnU-Net training and inference with our provided nnU-Net wrapper. Please note that I did not test these commands, so there will likely be some debugging required still.

The planning and preprocessing for the 2D networks is disabled by default. If the first network that you train is a 2D network, it will automatically be enabled, and can otherwise be enabled with --plan_2d. It is also possible to turn of the planning and preprocessing for the 3D networks with the flag --dont_plan_3d to save time and disk space when these networks will not be trained.

After the correct setup (see above), you can train the 2D nnU-Net by adding the --network 2d flag: docker run --cpus=8 --memory=64gb --shm-size=64gb --gpus='"device=0"' --rm -v /path/to/workdir:/workdir/ joeranbosma/picai_nnunet:latest nnunet plan_train Task2201_picai_baseline /workdir/ --trainer nnUNetTrainerV2_Loss_FL_and_CE_checkpoints --fold 0 --network 2d --custom_split /workdir/nnUNet_raw_data/Task2201_picai_baseline/splits.json (adapted from the nnU-Net tutorial)

When performing inference, the --network 2d flag should also be added: docker run --cpus=8 --memory=28gb --gpus='"device=0"' --rm -v /path/to/test_set/images:/input/images -v /path/to/workdir/results:/workdir/results -v /path/to/workdir/predictions:/output/predictions joeranbosma/picai_nnunet:latest nnunet predict Task2201_picai_baseline --trainer nnUNetTrainerV2_Loss_FL_and_CE_checkpoints --fold 0 --network 2d --checkpoint model_best --results /workdir/results --input /input/images/ --output /output/predictions --store_probability_maps (adapted from the nnU-Net tutorial)

Hope this helps, Joeran

Re: Baseline AI models for 3D csPCa detection/diagnosis in bpMRI: Updates and Fixes ¶

By: Doris on Aug. 22, 2022, 8:07 a.m.

Hi Joeran,

Thanks for your response. I will try to do that according to your suggested steps. That's really helpful for me!

Kind regards, Doris

Re: Baseline AI models for 3D csPCa detection/diagnosis in bpMRI: Updates and Fixes ¶

By: joeran.bosma on Sept. 8, 2022, 9:11 a.m.

Cross-validation Performance of PI-CAI Baselines¶

Hi all, in this post we will share the cross-validation performance of all PI-CAI baselines. For all considerations that went into the exact evaluation method, please see the end of this post.

Metrics per fold¶

nnU-Net metrics per checkpoint¶

For nnU-Net we evaluated the model performance every 50 epochs. The disconnected dot on the right shows the performance for the model_best checkpoint.

Cropped inference for nnU-Net¶

Through cropping the predictions outside the central region of 81 x 192 x 192 mm (which is much larger than a prostate, and cuts of e.g. the legs and air outside the patient), we remove nonsense predictions. The effect is small, but still considerable.

Do these metrics make sense?¶

Please see this post about our two cents on interpreting the observed model performances.

Evaluation settings¶

Cohort: labelled cases from the PI-CAI Public Training and Development dataset, with human-annotated lesion annotations, resampled to T2-weighted resolution (1295 cases across five folds).
Threshold for lesion candidate extraction: dynamic (this applies to the nnU-Net and U-Net baselines. Within picai_eval the default lesion candidate extraction method is dynamic-fast, which is considerably quicker than the extraction method dynamic and has almost equal performance (but typically a bit lower). In our baseline submissions, we opted for the dynamic extraction method, as that is what we would suggest for fully trained models, while we recommend dynamic-fast for evaluation during model training.).
Postprocessing: for nnU-Net, we noticed that some models generated predictions far outside of the prostate region, which are obviously nonsense predictions. In our latest update of the nnU-Net baselines, we addressed this issue by cropping predictions to a central region of 81 x 192 x 192 mm (which is much larger than a prostate, and cuts of e.g. the legs and air outside the patient). We applied the same cropping to the nnU-Net predictions during cross-validation.
Checkpoint: for all baselines the checkpoint model_best is used, as is also employed in the leaderboard submissions and shared in their respective baseline repositories.

If anything is missing or unclear, please let us know!

Kind regards, Joeran

Last edited by: joeran.bosma on Feb. 26, 2025, 5:58 p.m., edited 4 times in total.
Reason: Image size & interpretation

Re: Baseline AI models for 3D csPCa detection/diagnosis in bpMRI: Updates and Fixes ¶

By: svesal on Sept. 17, 2022, 6:39 p.m.

Hi @Joeran,

Thank you for sharing all these results. I have a more naive question. Based on the results you shared, nn-UNet overall performed the best on cross validation for most of the settings (supervisied and semi-supervised). However, on public LB UNet-semi-supervised achieved the best performance. Is there any explanation for this?

Also, I treid to reproduce the baseline results locally for both semi-UNet and semi-nnUNet, however the cross validation is par lower than your reported results.

Not sure, if we are missing something here.

Best, Sulaiman

Re: Baseline AI models for 3D csPCa detection/diagnosis in bpMRI: Updates and Fixes ¶

By: anindo on Sept. 17, 2022, 6:59 p.m.

Hi Sulaiman,

Based on the results you shared, nn-UNet overall performed the best on cross validation for most of the settings (supervisied and semi-supervised). However, on public LB UNet-semi-supervised achieved the best performance. Is there any explanation for this?

According to the Open Development Phase - Validation and Tuning Leaderboard, as of now, the baseline nnDetection (semi-supervised) seems to be the overall best model (considering both detection + diagnosis performance). It's true that the baseline U-Net (semi-supervised) seems to marginally outperform the baseline nnU-Net (semi-supervised) on that leaderboard. However it's worth considering that this ranking was facilitated using 100 cases only. On the final hidden testing cohort of 1000 cases (including cases from an unseen external center), these models may rank in a completely different order. Perhaps it’s best to look at performance across both sets of data (cross-validation using 1500 cases + held-out validation on the leaderboard using 100 cases) to inform your model development cycle, but we leave such decisions completely up to the participants.

Observing a substantial difference in performance between 5-fold cross-validation metrics using the training dataset of 1500 cases, and performance metrics on the leaderboard using the hidden validation cohort of 100 cases, is to be expected. We believe this is due to the factors discussed here.

Also, I treid to reproduce the baseline results locally for both semi-UNet and semi-nnUNet, however the cross validation is par lower than your reported results. Not sure, if we are missing something here.

Assuming that you're using the same number of cases [1295 cases with human annotations (supervised) or 1295 cases with human annotations + 205 cases with AI annotations (semi-supervised)] preprocessed the same way; and have trained (default command with the same number of epochs, data augmentation, hyperparameters, model selection, etc.), 5-fold cross-validated (same splits as provided) and ensembled (using member models from all 5 folds) the baseline AI models the exact same way as indicated in the latest iteration of picai_baseline, your performance during cross-validation and on the leaderboard should be similar to that of ours.

Deviations may still exist owing to the stochasticity of optimizing DL models at train-time —due to which, the same AI architecture, trained on the same data, for the same number of training steps, can typically exhibit slightly different performance each time (Frankle et al., 2019). Some AI models + training methods are more susceptible to this form of performance instability across training runs, than others (also depends on the task + dataset + supervision/annotations, of course). As we have only trained a single instance of each of our baseline AI models thus far, it's difficult to specifically comment on their expected variance in performance across training runs. During our final performance estimation of the top 5 teams' algorithms on the hidden testing cohort of 1000 cases (to determine the winner of the challenge), this factor will be accounted for [as detailed in Item 28 (pg 15) of our study protocol].

Hope this helps.

Last edited by: anindo on Aug. 15, 2023, 12:57 p.m., edited 12 times in total.

Re: Baseline AI models for 3D csPCa detection/diagnosis in bpMRI: Updates and Fixes ¶

By: svesal on Sept. 20, 2022, 7:02 p.m.

Thank you @Anindo, this information was helpful.