Hi Yuan,
I'm busy collecting all cross-validation results for the supervised and semi-supervised nnU-Net and nnDetection models. For nnU-Net I'm collecting the performance every 50 epochs, so this takes a bit of time to compute. I will post this soon (should be this week) to the forum (in this thread). I don't have the UNet cross-validation metrics on hand, but I'll include those in the future post.
As for the semi-supervised nnU-Net baseline, there are some considerations to be made there. If you follow the instructions of picai_eval
, then the default lesion candidate extraction method is dynamic-fast
, which is considerably quicker than the extraction method dynamic
and has almost equal performance (but typically a bit lower). In our baseline submissions, we opted for the dynamic
extraction method, as that is what we would suggest for fully trained models, while we recommend dynamic-fast
for evaluation during model training.
Additionally, we noticed that some models generated predictions far outside of the prostate region, which are obviously nonsense predictions. In our latest update of the PI-CAI baselines, we addressed this issue by cropping predictions to a central region of 81 x 192 x 192 mm (which is much larger than a prostate, and cuts of e.g. the legs and air outside the patient).
So, with these considerations, we evaluate our models using the dynamic
extraction method and set predictions outside the central region of 81 x 192 x 192 mm to zero. Then, for the checkpoint model_best
(as shared in the supervised and semi-supervised baselines), we observed the following metrics:
Ranking score | AUROC | AP
Fold 0: 0.5787 | 0.8185 | 0.3388
Fold 1: 0.7067 | 0.8707 | 0.5427
Fold 2: 0.6357 | 0.8186 | 0.4529
Fold 3: 0.5708 | 0.8068 | 0.3348
Fold 4: 0.6751 | 0.8596 | 0.4906
Mean: 0.6334 | 0.8348 | 0.4320
Did you evaluate the semi-supervised models against the manually annotated cases only, or did you include the AI-derived annotations as well?
Once we're sure we employed the exact same evaluation strategy, we should get the exact same performances as well.
Kind regards,
Joeran