Grand Challenge

Validation Metric Calculation ¶

By: 坤坤kk on May 29, 2024, 12:56 p.m.

Hello, I have trained a model with the data from the challenge, but I encountered an issue during evaluation: there are a total of 30 segmentation classes, yet each volume contains a very limited number of these classes. Consequently, during the validation phase, I am unsure whether the evaluation metrics should be calculated based on the limited number of classes present in the labels (one-to-one) or according to all 30 classes. Utilizing all classes for the computation of metrics results in a significantly lower average Dice coefficient for each volume.

Re: Validation Metric Calculation ¶

By: YSang on May 30, 2024, 7:48 a.m.

Thank you for checking with us.

For both tasks, the evaluation process involves a matching process, where each ground-truth label is matched with a predicted channel based on the highest IoU within the same anatomy.

We have recently uploaded the evaluation code, which you can run locally. Please let us know if you find any mistake or confusion in it.

Re: Validation Metric Calculation ¶

By: 坤坤kk on May 30, 2024, 1:34 p.m.

Thank you for your reply. I utilized the assessment code you provided and received the metrics for case 100.mha as follows: {'fracture_iou': 9.09340558532654e-05, 'fracture_hd95': 159.62318828582764, 'fracture_assd': 42.99843762715658, 'anatomical_iou': 0.00026462504194341244, 'anatomical_hd95': 145.51675643920896, 'anatomical_assd': 37.63869603474935}. Nevertheless, I employed a category-specific computation approach, which entails that only the classes present in the label are considered when calculating the dice and hd95 against the corresponding classes in the predicted volume. For instance, if the label includes only classes: 1, 11, 21, then only these three classes from both the label and the predicted volume are matched to calculate the dice and hd95 respectively, and then an average of these three classes' metrics is taken. The results are as follows:case 100.mha mean_dice: 0.955612 mean_hd95: 0.978553. There is a noticeable discrepancy, and I am uncertain about the cause. Furthermore, I observed that classes 1, 11, and 21 have a high dice score in my computation method, yet in your evaluation code, the IoU for these classes is lower compared to all other classes.

Re: Validation Metric Calculation ¶

By: YSang on June 3, 2024, 1:47 a.m.

Please email me one of your network outputs and we may have a chance to look into it.