Instance-based Evaluation of Dice and F1 Score

Instance-based Evaluation of Dice and F1 Score  

  By: AWinder on Aug. 23, 2024, 7:57 p.m.

Hello,

I have a concern about the instance-based Dice and F1 metrics that seem to be used from Panoptica. They appear to be set up so that each predicted and ground-truth segmentation is divided into connected components, and each connected component is only scored toward the Dice and F1 scores under very specific conditions:

  1. The ground truth component must be at least half-covered by the predicted component.

  2. The predicted component from 1. must have no more voxels outside the ground-truth component from 1. than are inside the ground-truth component from 1.

Otherwise, the ground-truth component is not counted as having any true positives, which easily leads to a Dice and F1 of zero.

Already in the preliminary evaluation, we can see several instances where this has occurred. For example, submission ID 6bf6c8af-c10e-4da5-90d4-89edae7a40cf using the Dynunet algorithm scores a mean Dice of only 0.44 because it scored 0.87 on one image but 0.00 on the other. Scoring lower average Dice isn't inherently a problem because the scores of each model are only compared to each other and not interpreted in absolute terms, but I am concerned that this instance-based evaluation metric doesn't reflect what actually constitutes a clinically-useful lesion segmentation.

Consider a couple of toy examples in one dimension, where 1 is a lesion area and 0 is healthy brain parenchyma:

GROUND_TRUTH = [0, 1, 0, 1, 0]

PREDICTION = [0, 1, 1, 1, 0]

Conventional Dice = 0.80; Instance-based Dice = 0.00

Here, each ground-truth connected component (independently) has one predicted voxel inside the ground-truth component and two outside. 2>1 so neither component scores, resulting in a metric of zero.

GROUND_TRUTH = [0, 1, 0, 1, 0]

PREDICTION = [1, 1, 0, 1, 1]

Conventional Dice = 0.67; Instance-based Dice = 0.67

Here, each ground-truth connected component has one predicted voxel inside the component and one outside. 1<=1 so both components score, resulting in a metric that is equivalent to conventional Dice.

Comparing the two scenarios, the latter example has a significantly greater instance-based Dice despite having a lower conventional Dice and making a greater number of mistakes in terms of the absolute number of voxels misclassified. Furthermore, I suspect that most strokes would result in a pattern of focal ischemia where the former prediction would be more biologically feasible than the latter. Is it truly the case that the model prediction in the former scenario is meant to be penalized so heavily compared to the latter? To extend the hypothetical scenario into two dimensions:

Given that the number of false positives in both instances is the same, is the righthand scenario truly preferable from a computational and clinical perspective? Additionally, do these metrics potentially suffer from a floor effect whereby models scoring zero become incomparable to each other?

Best, Anthony

 Last edited by: AWinder on Aug. 23, 2024, 9:20 p.m., edited 6 times in total.
Reason: Fixed line breaks

Re: Instance-based Evaluation of Dice and F1 Score  

  By: ezequieldlrosa on Aug. 26, 2024, 8:07 a.m.

Hi Anthony,

Thanks for sharing your thoughts on our evaluation. Some comments on my end:

1) The Dice score we compute is a traditional 'global' Dice. While Panoptica allows for instance-based Dice, we're sticking with the global approach. You can check our Jupyter notebook and compare the results with your own calculations.

2) The lesion-wise F1-score is based on an IoU overlap of at least 50%. This ties into how we define a True Positive, which isn't straightforward. A simple definition would consider a TP when there's at least one voxel of overlap between the ground truth and predictions, but this has many limitations both computationally and clinically (we have seen them in our previous ISLES'22 and that’s why we’ve adopted a more refined approach). Clinically, proper instance identification helps us differentiate various stroke patterns, like embolic showers, fragmented lesions, or single massive strokes, and understand the involvement of multiple vascular territories or watershed ones, which is crucial for identifying e,g, specific stroke etiologies.

3) Regarding the 'mean values' of the metrics you mentioned: Just a quick note that, while GC reports them, we don't necessarily use them. Since 2015, the ISLES challenge ranking has been based on a 'rank then aggregate' method. This approach offers several advantages in terms of robustness, which is why other challenges, like BraTS, have also adopted it. For more insights into this ranking method, you might want to check out the DKFZ Data Science and Digital Oncology group in Heidelberg—they have published several papers discussing it in detail.

I hope this clears things up!

Best, Ezequiel

Re: Instance-based Evaluation of Dice and F1 Score  

  By: AWinder on Aug. 26, 2024, 10:22 p.m.

Thank you very much for such a detailed response!

As per your point 2, I better understand now the clinical rationale behind using connected-component-based metrics. Similarly, I can appreciate the rationale behind the 'rank then aggregate' method described in your point 3. I will read the papers that you mentioned, but my main concern with the mean Dice was not that I thought it was a good challenge metric, but that it demonstrated that something computationally unexpected was occurring during evaluation. When you say that you mean to use a traditional 'global' dice, do you refer to the following voxel-wise calculation?

2*TP / (2*TP + FP + FN)

Where a TP is a voxel that is set to 1 in both the ground-truth and the prediction, a FP is a voxel set to 0 in the ground-truth but 1 in the prediction, and a FN is a voxel set to 1 in the ground-truth and 0 in the prediction?

If so, then I think that there might be a hiccup in the Panoptica configuration. My original concern actually came from tinkering with the notebook that you sent, as my calculations do not match those generated from Panoptica, and I can cause the Panoptica evaluation to return a Dice value of zero with a very simple manipulation of the predicted lesion that does not zero out the conventional voxel-wise Dice metric.

If the concepts of TP, FP, FN used to compute the Dice are not voxel-wise but instead are based on how many of the ground-truth connected components are identified based on an IoU of >0.5, then perhaps the numbers are correct, but I haven't evaluated this by hand and it sounds from your response that this isn't the intention.

I'm not familiar with the GC interface, but perhaps there would be a way to message you directly with an edited copy of your Jupyter notebook that demonstrates my concerns.