Hi Sakina,
Let's assume that you're using the baseline U-Net (semi-supervised) GC algorithm template, as your base GC algorithm template. In that case, when you execute the script 'picai_unet_semi_supervised_gc_algorithm/test.sh' on Linux/macOS, your system expects all the files in 'picai_unet_semi_supervised_gc_algorithm/test/' (i.e. input images + expected output predictions for the given algorithm) to exist for testing purposes. Same applies for Windows-based systems, when executing 'picai_unet_semi_supervised_gc_algorithm/test.bat'.
Here, testing files cspca_detection_map.mha and cspca-case-level-likelihood.json are the expected output predictions only for our provided trained model weights. That's why these expected output predictions don't match that of your independently trained model, and that's basically what that error message is indicating. If you want to test the container, but for your own trained model, you need to replace those expected output prediction files with ones produced by your own model for that given input case (imaging + clinical information). In any case, testing before exporting your algorithm container is helpful for debugging, but it isn't a mandatory step and can be skipped.
"Also when using your baseline models only and training, there is still a huge gap between our performances. I would appreciate your help in understanding this too."
Assuming that you're using the same number of cases [1295 cases with human annotations (supervised) or 1295 cases with human annotations + 205 cases with AI annotations (semi-supervised)] preprocessed the same way; and have trained (default command with the same number of epochs, data augmentation, hyperparameters, model selection), 5-fold cross-validated (same splits as provided) and ensembled (using member models from all 5 folds) the baseline AI models the exact same way as indicated in the latest iteration of picai_baseline, your performance on the leaderboard should be similar to that of ours. Deviations may still exist owing to the stochasticity of optimizing DL models at train-time —due to which, the same AI architecture, trained on the same data, for the same number of training steps, can typically exhibit slightly different performance each time (Frankle et al., 2019).
Now you should also observe a substantial difference in performance between your 5-fold cross-validation metrics using the training dataset of 1500 cases, and that of your/our performance on the leaderboard using the hidden validation cohort of 100 cases. This is to be expected, due to the factors discussed here.
Hope this helps.