Difficulties directly using the pretrained models

Difficulties directly using the pretrained models  

  By: Anamana on Aug. 5, 2022, 12:23 p.m.

Hi,

I want to experiment with postprocessing and tuning, and would like to save me the time and resources of training by using the pretrained models, like recommended here: https://grand-challenge.org/forums/forum/pi-cai-607/topic/data-preparation-and-training-resources-844/. Ideally, I would like to be able to use the validation splits again for quick validation during experimentation (so that I can e.g. evaluate changes with the model from fold 0 on the validation set of fold 0).

In order to do this I have prepared the data for the semi-supervised nnUNet using the picai_prep code and planned and preprocessed it. Then I copypasted + renamed the pretrained models into their respective folds, copied the plans into the designated folder and reran the validation per fold. Evaluating 'validation_raw_postprocessed' , further postprocessed with extract_lesion_candidates, with the evaluation code from the nnUNet baseline instructions, the results I got are:

AUROCs: [.72, .71, .71, .68, .78] APs: [.17. .22, .12, .14, .24]

On the leaderboard it says: AUROC: 0.820 and AP: 0.608. As my validation results are much worse, I wonder if something went wrong (I also tried similar steps for fold 0 of nnDetection and got a AUROC of .77, which similarly seems far removed from the 0.885 on the leaderboard).

I assume this gap between the ensemble on the hidden set and the separate models on their validation sets isn't reasonable? And, if not, do you happen to see a mistake in what I did, or would you recommend a different way to be able to use the pretrained models while being able to evaluate them easily?

Many thanks

Re: Difficulties directly using the pretrained models  

  By: anindo on Aug. 6, 2022, 10:36 a.m.

Hi Demster,

Indeed, at our end, we’ve also noticed substantial differences in performance between the ensemble on the Hidden Validation and Tuning Cohort, and the separate models on the five internal (cross-) validation folds using the Public Training and Development Set. We believe that a number of nuanced factors are worth considering, when interpreting this difference in performance:

  • Imaging sequences for all cases in the Hidden Validation and Tuning Cohort are co-registered, and the overall prevalence of csPCa in this dataset is enriched. Hence, performance here may be more representative of performance on the Hidden Testing Cohort, where imaging sequences will also be co-registered and the overall prevalence of csPCa is enriched as well. Note, imaging for several cases in the Public Training and Development Set (used for the cross-validation folds) are not co-registered, which can lead to inaccurate predictions because of the misalignment itself (which won’t be there during final evaluation on the Hidden Testing Cohort) rather than the model’s predictive capacity.

  • Ensemble of five member models is used for the Hidden Validation and Tuning Cohort, but only single member models are used for each validation fold during cross-validation.

  • Each validation fold (300 cases) of five-fold cross-validation contains 3x as many cases as the total Hidden Validation and Tuning Cohort (100 cases). Across all validation folds (1500 cases), 15x as many cases are used to estimate performance during internal cross-validation, than validation on the Hidden Validation and Tuning Cohort. Hence, one may argue that performance on internal five-fold cross-validation is actually more representative of true performance, than performance on the Hidden Validation and Tuning Cohort (or the Open Development Phase – Validation and Tuning Leaderboard). Note, during the final evaluation on the Hidden Testing Cohort, 1000 held-out cases will be used.

For all these reasons, performance metrics cannot be compared directly across evaluation sets (cross-validation using Public Training and Development Set, and Open Development Phase – Validation and Tuning Leaderboard). Perhaps it’s best to look at performance across both sets of data to inform your model development cycle, but we leave these model development decisions to the participants.

As for the exact AUROCs and APs that you obtained for the nnU-Net (semi-supervised), they still seem to be lower than what we obtained at our end. Are you using all 1500 cases for cross-validation (i.e. 1295 cases with human expert annotations + 205 cases with AI annotations)? Did you follow the exact steps outlined here and here?

Hope this helps.

 Last edited by: anindo on Aug. 15, 2023, 12:57 p.m., edited 2 times in total.

Re: Difficulties directly using the pretrained models  

  By: joeran.bosma on Aug. 6, 2022, 11:35 a.m.

Hi Demster,

Your evaluation of the nnU-Net predictions doesn't appear to be right. You should evaluate the npz/softmax predictions, rather than the binarized predictions from nnU-Net (in validation_raw_postprocessed).

Please check out the final two links Anindo send for the evaluation instructions regarding this.

Hope this helps, Joeran