Low Prediction Metrics

Low Prediction Metrics  

  By: shipc1220 on July 18, 2024, 4:38 a.m.

Dear Organizers,

I have encountered an unusual issue with my prediction metrics. When testing locally and using the Grand Challenge Algorithm platform, the outputs for Oral-pharyngeal segmentation files closely match the results I obtain. However, after submitting my results to the Preliminary Phase, the metrics are extremely low.

To investigate further, I used the evaluation script provided at evaluation.py to calculate Dice and Hausdorff Distance (HD) locally. The results were significantly different from those shown on the Grand Challenge platform. For example, for ToothFairy2F_065_0000.mha (where the prediction appears consistent with the ground truth in ITK-SNAP 3D visualization), the average Dice coefficient I calculated locally is 0.914. However, the platform displays a Dice coefficient of 0.091 (as seen here).

This discrepancy is not only limited to Dice coefficients but also affects HD and other cases. I would like to inquire if the input images and ground truth labels you use internally are the same as those in imagesTr and labelsTr. Have you performed similar tests to validate the consistency of the evaluation script? Could this issue be related to discrepancies in the evaluation script or the Grand Challenge platform itself?

Thank you for your assistance in resolving this matter.

 Last edited by: shipc1220 on July 21, 2024, 5:57 a.m., edited 1 time in total.

Re: Low Prediction Metrics  

  By: llumetti on July 21, 2024, 1:42 p.m.

Dear shipc1220,

I have responded to your question on our GitHub page. For any further discussion on this topic, please refer to the following link: https://github.com/AImageLab-zip/ToothFairy/issues/9. If anyone else is interested in joining the conversation, please contribute on GitHub to keep everything in one place.

Best regards,
Luca Lumetti