Low Dice Score

Low Dice Score  

  By: shipc1220 on July 15, 2024, 7:55 p.m.

Hi Organizers,

I have successfully tested my model using both the local Docker environment and the Algorithms platform, achieving an average Dice score of over 0.70 across 23 vascular categories on the validation set. However, when I submitted the model for the development phase, the score was significantly lower, around 0.05. Could you please explain the potential reasons for this discrepancy? Are there any differences between the test data and the training data that might have caused this issue?

Thank you.

 Last edited by: shipc1220 on July 15, 2024, 7:59 p.m., edited 3 times in total.

Re: Low Dice Score  

  By: imran.muet on July 15, 2024, 9:31 p.m.

Dear Participant,

Thank you for contacting us about your concerns. We understand the challenges you're facing and want to assure you that the validation data is designed to match the training dataset in terms of characteristics like intensity range, spacing, and dimensions.

It's important to note that many participants have successfully submitted Docker images multiple times, achieving a Dice score of 0.75. This shows that with well-written code, it's feasible to run your Docker image effectively and achieve good results.

Here are some suggestions that may assist you:

  • Inference Script: Use the provided inference.py file as is. Only modify the section where you load your model, keeping the rest unchanged.

  • Dice Coefficient Calculation: Calculate the Dice coefficient using the monai.metrics.DiceMetric class, ensuring include_background is set to False for local performance evaluation.

  • Image Orientation: Ensure image orientation and other properties are consistent. Even with minor variations, the Dice score shouldn't drop significantly.

Re: Low Dice Score  

  By: shipc1220 on July 16, 2024, 1:40 a.m.

Dear Organizers,

Thank you for your prompt response and for addressing our concerns. I appreciate your efforts to ensure that the validation data matches the training dataset in characteristics like intensity range, spacing, and dimensions.

However, I would like to clarify one crucial point: Is the orientation of the test data consistent with that of the training data? All training data is oriented in RAI, and our model (based on nnU-Net framework) has been trained in RAI. Upon reviewing the code in your repository, specifically the scripts training/utils/dataset.py and validation/utils/dataset.py, we noticed that there is a step transformsOrientationd(keys=["image", "label"], axcodes="RAS").

However, this transformation step seems to be absent in the inference script Docker_Preparation_For_Submission/inference.py. Could it be possible that the test data orientation was converted to RAS, or might there be another reason for this discrepancy?

Additionally, I have observed that other participants, such as “pzhhhhh” and “JL_Ng”, have reported similarly low Dice scores. Specifically, I noticed comparable results in the "Left_Subclavian_Artery_Dice_Score", "Zone_1_Dice_Score", and "Zone_2_Dice_Score". It seems we are encountering the same issue.

Given the limited number of submission attempts available, this issue is quite urgent.

Thank you for your understanding and support.

 Last edited by: shipc1220 on July 16, 2024, 1:42 a.m., edited 2 times in total.

Re: Low Dice Score  

  By: imran.muet on July 16, 2024, 1:56 p.m.

Hi,

Thank you for your inquiry. Here are the answers to your questions:

  1. Orientation of Test Data: The test data is oriented the same way as the training data, which is RAS.
  2. Orientation Transformation: You are correct that there is no transformsOrientationd(keys=["image", "label"], axcodes="RAS") step for the testing dataset. This is because the testing data is already in RAS orientation. The shared code is meant to assist you in preparing your code and is not the final version. If you prefer, you can add this transformation to your code.
  3. Dice Scores for Difficult Regions: It is true that the regions "Left_Subclavian_Artery," "Zone_1," and "Zone_2" are challenging to segment. However, their low Dice scores should not significantly affect your overall Dice score. Ideally, your worst-case scenario Dice score should still be above 0.5.
  4. Submission Attempts: You can upload your results as many times as you want until July 30, 2024, if your attempts are unsuccessful. You have a limit of five successful submissions, but unsuccessful attempts do not count towards this limit.

We understand this situation can be frustrating. Here are some suggestions that might help you resolve your issue:

  • Dice Coefficient Calculation: Ensure you are computing the Dice coefficient using the monai.metrics.DiceMetric class with include_background set to False.
  • Improving Dice Score: To potentially improve your Dice score, consider experimenting with different training and validation splits. For instance, try keeping five, two, or even one image for validation and use the remaining images for training. Train and submit the model that yields the best results from these variations.

If you need further assistance, we can arrange a Zoom meeting. Please email me at a.cosman@ufl.edu to schedule an online meeting.

Re: Low Dice Score  

  By: shipc1220 on July 16, 2024, 3:47 p.m.

Thank you for the clarification regarding the orientation of the test data. I have downloaded the training data and checked them using ITK-SNAP, and they are consistently in RAI orientation. Despite redownloading them, they remain in RAI orientation. Could there be a version discrepancy in the dataset causing this inconsistency?

Re: Low Dice Score  

  By: shipc1220 on July 17, 2024, 3:38 p.m.

Dear Organizers,

I have successfully tested my model using both the local Docker environment and the Algorithms platform, achieving an average Dice score of over 0.70 across 23 vascular categories on the validation set. However, there is a significant drop in the Dice score on the test set, which is perplexing. Could you please consider releasing one example from the test set (only the imaging data) to allow participants to verify consistency across the training and test sets on the Algorithms platform?

 Last edited by: shipc1220 on July 17, 2024, 3:48 p.m., edited 1 time in total.

Re: Low Dice Score  

  By: imran.muet on July 21, 2024, 3:52 p.m.

Hi,

The evaluation code used in the background on the Grand-Challenge website has been shared. You can access it via this link.

Regarding the answers to your questions:

Orientation of Test Data: When you save your segmentation files, they should have the same properties as the original input volume used to create those segmentations, including orientation, spacing, origin, direction, etc.

Orientation Transformation: You are correct that the step transformsOrientationd(keys=["image", "label"], axcodes="RAS") is not present in the inference.py file. The shared code is intended to assist you in preparing your own code and is not the final version. If you prefer, you can add this transformation to your code.

For visualization purposes, please use 3D Slicer.

If you need further assistance, we can arrange a Zoom meeting. Please email me at a.cosman@ufl.edu to schedule an online meeting.

 Last edited by: imran.muet on Aug. 7, 2024, 7:09 p.m., edited 1 time in total.

Re: Low Dice Score  

  By: lWM on Aug. 7, 2024, 3:38 p.m.

Hello,

I would like to ask - how was the problem finally solved?

I believe I am suffering from similar issue related to the orientation of the input data (internal evaluation on validation subsets: ~0.73-0.75), however with dramatic drop in performance on the external validation set (DSC close to 0 for any Left/Right structures).

Bests,

Re: Low Dice Score  

  By: imran.muet on Aug. 7, 2024, 7:13 p.m.

Please visit the informative discussion about the orientation here. We hope you find it helpful. If you have any further questions, feel free to post them here.

Re: Low Dice Score  

  By: lWM on Aug. 7, 2024, 7:40 p.m.

Thanks!. From the discussion it seems that the assumptions about the initial orientation are quite important and they differ between training and test sets.

Could you please share a single case from the training set (image + labels), however, preprocessed in a way that follows the orientation from the hidden validation/test sets? It would make the debugging process much easier and allow to avoid any trial and errors attempts.

Bests,