Question about validating the pre-trained nnUNet

Question about validating the pre-trained nnUNet ¶

By: Doris on Aug. 31, 2022, 3:57 a.m.

Hi, I try to validate the pre-trained nnUNet baseline, while when I run the evaluation following this https://github.com/DIAGNijmegen/picaibaseline/blob/main/nnunetbaseline.md#nnu-net---evaluation, there is a dimension not match error.

Error for 10032_1000032: boolean index did not match indexed array along dimension 1; dimension is 367 but corresponding boolean dimension is 384

I noticed that the size of the data in npz file is the "after cropping size". Do I need to add other post-processing between https://github.com/DIAGNijmegen/picaibaseline/blob/main/nnunetbaseline.md#nnu-net---inference and https://github.com/DIAGNijmegen/picaibaseline/blob/main/nnunetbaseline.md#nnu-net---evaluation?

Kind regards, Yuan

Re: Question about validating the pre-trained nnUNet ¶

By: joeran.bosma on Aug. 31, 2022, 7:25 a.m.

Hi Yuan,

Thanks for raising this issue! You are indeed correct: the documentation is incomplete, and evaluation requires an additional step. This issue was circumvented using a hot-fix in the (semi-supervised) nnUNet eval.py. Unfortunately, I recently discovered that this hot-fix is invalid (although often close). I'm working on a fix for this, and expect to have it ready today or tomorrow. I'll update you here!

Kind regards, Joeran

Re: Question about validating the pre-trained nnUNet ¶

By: Doris on Sept. 1, 2022, 1:21 a.m.

Okay, thanks for your hard working on this project!

Bests, Yuan

Re: Question about validating the pre-trained nnUNet ¶

By: joeran.bosma on Sept. 1, 2022, 12:50 p.m.

Hi Yuan,

I've been able to resolve the issue with nnU-Net inference, and have updated the picai_baseline evaluation script and documentation accordingly!

The issue affected cases which were cropped asymmetrically by nnU-Net, in which case the lesion prediction would be shifted. The case you mentioned is a good example of that happening. In the figure below you can see the correctly converted prediction in the top right panel (which aligns well with the ADC dark spot) and the incorrectly shifted prediction in the lower center panel:

Please let me know if you run into any issues when using the updated script and documentation.

Kind regards, Joeran

Re: Question about validating the pre-trained nnUNet ¶

By: Doris on Sept. 5, 2022, 1:32 p.m.

Hi Joeran,

Thank you for helping me to deal with this. Based on the new guidance, I ran the cross-validation of semi-supervised nnUNet. While I noticed that there is a big gap between the semi-supervised U-Net and semi-supervised nnUNet, although their test results are very close. I am not sure why this happens. Logically, the cross-validation result should also be close of these two models. Below are my results, is that same as yours? If not, what do you think might be wrong?

U-Net semi-supervised（1500）
Ranking score AUROC AP fold 0 0.527 0.751 0.303 fold 1 0.526 0.719 0.334 fold 2 0.491 0.721 0.262 fold 3 0.443 0.685 0.201 fold 4 0.478 0.677 0.278 avg 0.493 0.711 0.276

nnUNet semi-supervised（1500）
Ranking score AUROC AP fold 0 0.660 0.826 0.495 fold 1 0.686 0.838 0.534 fold 2 0.690 0.817 0.562 fold 3 0.609 0.802 0.415 fold 4 0.723 0.858 0.589 avg 0.674 0.828 0.519

Kind regards, Yuan

Re: Question about validating the pre-trained nnUNet ¶

By: Doris on Sept. 6, 2022, 7:41 a.m.

Hi joeran,

Based on my previous results, I generate the detection map of semi-supervised U-Net and than re-evaluate the detection results according to the picai_eval (https://github.com/DIAGNijmegen/picai_eval/tree/evaluate-softmax-from-command-line), then I get the average result:

auroc: 79.73%, AP: 47.77%

This is more close to the semi-supervised nnUNet result and the public test result. I guess there is an error in the U-Net baseline model itself about the cross-validation evaluation. Could you please check the related code to figure it out? Thanks a lot.

Kind regards, Yuan

Re: Question about validating the pre-trained nnUNet ¶

By: joeran.bosma on Sept. 6, 2022, 12:52 p.m.

Hi Yuan,

I'm busy collecting all cross-validation results for the supervised and semi-supervised nnU-Net and nnDetection models. For nnU-Net I'm collecting the performance every 50 epochs, so this takes a bit of time to compute. I will post this soon (should be this week) to the forum (in this thread). I don't have the UNet cross-validation metrics on hand, but I'll include those in the future post.

As for the semi-supervised nnU-Net baseline, there are some considerations to be made there. If you follow the instructions of picai_eval, then the default lesion candidate extraction method is dynamic-fast, which is considerably quicker than the extraction method dynamic and has almost equal performance (but typically a bit lower). In our baseline submissions, we opted for the dynamic extraction method, as that is what we would suggest for fully trained models, while we recommend dynamic-fast for evaluation during model training.

Additionally, we noticed that some models generated predictions far outside of the prostate region, which are obviously nonsense predictions. In our latest update of the PI-CAI baselines, we addressed this issue by cropping predictions to a central region of 81 x 192 x 192 mm (which is much larger than a prostate, and cuts of e.g. the legs and air outside the patient).

So, with these considerations, we evaluate our models using the dynamic extraction method and set predictions outside the central region of 81 x 192 x 192 mm to zero. Then, for the checkpoint model_best (as shared in the supervised and semi-supervised baselines), we observed the following metrics:

Ranking score | AUROC | AP Fold 0: 0.5787 | 0.8185 | 0.3388 Fold 1: 0.7067 | 0.8707 | 0.5427 Fold 2: 0.6357 | 0.8186 | 0.4529 Fold 3: 0.5708 | 0.8068 | 0.3348 Fold 4: 0.6751 | 0.8596 | 0.4906 Mean: 0.6334 | 0.8348 | 0.4320

Did you evaluate the semi-supervised models against the manually annotated cases only, or did you include the AI-derived annotations as well?

Once we're sure we employed the exact same evaluation strategy, we should get the exact same performances as well.

Kind regards, Joeran

Re: Question about validating the pre-trained nnUNet ¶

By: Doris on Sept. 6, 2022, 4:58 p.m.

Dear joeran,

Thanks for your help. Sorry to cause you trouble. For my evaluation, I use the old baseline (didn't noticed there is a new one). And when I evaluate the semi-supervised models, I include the AI-derived annotations as well. Should I kick them out when I do the evaluation? It seems better since I noticed that there are some errors in the AI-derived annotations. I also noticed that sometimes models generated predictions far outside of the prostate region, it's good to find a way to solve this problem.

Well, I will try to follow your way to check the semi-supervised nnUNet results. For the U-Net model, I will check the code first since the evaluation results are inconsistent, while the detection map seems right. And I will let you know if I solve the problem.

Kind regards, Yuan

Re: Question about validating the pre-trained nnUNet ¶

By: joeran.bosma on Sept. 8, 2022, 9:25 a.m.

Hi Yuan,

Don't worry about it! Happy to cross-check the evaluation of the baseline models.

I would indeed do the model evaluation using human-annotated cases only, particularly for lesion-based evaluation. This ensures you don't get confirmation bias of incorrect AI-derived annotations. For case-based evaluation the full set of 1500 cases can be used, as the case-based labels were manually determined.

I have now posted all metrics in the forum: #3273.

Hope this helps, Joeran

Last edited by: joeran.bosma on Aug. 15, 2023, 12:57 p.m., edited 1 time in total.

Re: Question about validating the pre-trained nnUNet ¶

By: Doris on Sept. 12, 2022, 12:50 a.m.

Hi Joeran,

Thanks a lot for your excellent work. It's quite helpful!

Bests, Yuan

Re: Question about validating the pre-trained nnUNet ¶

By: svesal on Sept. 20, 2022, 6:53 p.m.

Hi Joeran,

Thank you for all the hard work.

I was wondering if there is any eval.py for UNet or Semi-supervised UNet model for local cross-validation and exporting detection maps similar to nnUNet eval.py?

The reason I asked becasue the gap between our CV and LB is quite large. Now I am wondering if our local CV evaluation is not correct. Folds: AUC-ROC |AP| Score Fold 0: 0.719 | 0.282 | 0.501 Fold 1: 0.715 | 0.274 | 0.495 Fold 2: 0.694 | 0.306 | 0.500 Fold 3: 0.685 | 0.213 | 0.449 Fold 4: 0.705 | 0.316 | 0.510

CV Av: 0.704 | 0.278| 0.491 LB AV: 0.808 | 0.610| 0.709

Best, Sulaiman

Re: Question about validating the pre-trained nnUNet ¶

By: anindo on Sept. 21, 2022, 10:04 a.m.

Hi Sulaiman,

Unfortunately, when using the U-Net baseline, csPCa detection maps and case-level csPCa likelihood scores for training/validation folds are not automatically generated or stored at the end of training. And we have not and will not be adding this functionality anytime soon either (like the inference and evaluation scripts from the nnU-Net baseline).

If you wish to generate and store these predictions for your trained U-Net model, you can refer to this script. You can adapt it to set up your own Python script or Jupyter notebook, that given any bpMRI exam, initializes the U-Net architecture, loads your trained model weights, and then generates + saves all output prediction files accordingly. Needless to say, to estimate cross-validation performance, you should generate predictions using individual member models (instead of the full 5-member ensemble) and their respective validation folds. You can then use picai_eval to evaluate those csPCa detection maps.

As discussed before, observing a substantial difference in performance between 5-fold cross-validation metrics using the training dataset of 1500 cases, and performance metrics on the leaderboard using the hidden validation cohort of 100 cases, is to be expected. We believe this is due to the factors discussed here.

Hope this helps.