Grand Challenge

Dear Organizers,

I have encountered an unusual issue with my prediction metrics. When testing locally and using the Grand Challenge Algorithm platform, the outputs for Oral-pharyngeal segmentation files closely match the results I obtain. However, after submitting my results to the Preliminary Phase, the metrics are extremely low.

To investigate further, I used the evaluation script provided at evaluation.py to calculate Dice and Hausdorff Distance (HD) locally. The results were significantly different from those shown on the Grand Challenge platform. For example, for ToothFairy2F_065_0000.mha (where the prediction appears consistent with the ground truth in ITK-SNAP 3D visualization), the average Dice coefficient I calculated locally is 0.914. However, the platform displays a Dice coefficient of 0.091 (as seen here).

This discrepancy is not only limited to Dice coefficients but also affects HD and other cases. I would like to inquire if the input images and ground truth labels you use internally are the same as those in imagesTr and labelsTr. Have you performed similar tests to validate the consistency of the evaluation script? Could this issue be related to discrepancies in the evaluation script or the Grand Challenge platform itself?

Thank you for your assistance in resolving this matter.

Low Prediction Metrics

Low Prediction Metrics ¶

Re: Low Prediction Metrics ¶