Instance-based Evaluation of Dice and F1 Score ¶
By: AWinder on Aug. 23, 2024, 7:57 p.m.
Hello,
I have a concern about the instance-based Dice and F1 metrics that seem to be used from Panoptica. They appear to be set up so that each predicted and ground-truth segmentation is divided into connected components, and each connected component is only scored toward the Dice and F1 scores under very specific conditions:
-
The ground truth component must be at least half-covered by the predicted component.
-
The predicted component from 1. must have no more voxels outside the ground-truth component from 1. than are inside the ground-truth component from 1.
Otherwise, the ground-truth component is not counted as having any true positives, which easily leads to a Dice and F1 of zero.
Already in the preliminary evaluation, we can see several instances where this has occurred. For example, submission ID 6bf6c8af-c10e-4da5-90d4-89edae7a40cf using the Dynunet algorithm scores a mean Dice of only 0.44 because it scored 0.87 on one image but 0.00 on the other. Scoring lower average Dice isn't inherently a problem because the scores of each model are only compared to each other and not interpreted in absolute terms, but I am concerned that this instance-based evaluation metric doesn't reflect what actually constitutes a clinically-useful lesion segmentation.
Consider a couple of toy examples in one dimension, where 1 is a lesion area and 0 is healthy brain parenchyma:
GROUND_TRUTH = [0, 1, 0, 1, 0]
PREDICTION = [0, 1, 1, 1, 0]
Conventional Dice = 0.80; Instance-based Dice = 0.00
Here, each ground-truth connected component (independently) has one predicted voxel inside the ground-truth component and two outside. 2>1 so neither component scores, resulting in a metric of zero.
GROUND_TRUTH = [0, 1, 0, 1, 0]
PREDICTION = [1, 1, 0, 1, 1]
Conventional Dice = 0.67; Instance-based Dice = 0.67
Here, each ground-truth connected component has one predicted voxel inside the component and one outside. 1<=1 so both components score, resulting in a metric that is equivalent to conventional Dice.
Comparing the two scenarios, the latter example has a significantly greater instance-based Dice despite having a lower conventional Dice and making a greater number of mistakes in terms of the absolute number of voxels misclassified. Furthermore, I suspect that most strokes would result in a pattern of focal ischemia where the former prediction would be more biologically feasible than the latter. Is it truly the case that the model prediction in the former scenario is meant to be penalized so heavily compared to the latter? To extend the hypothetical scenario into two dimensions:
Given that the number of false positives in both instances is the same, is the righthand scenario truly preferable from a computational and clinical perspective? Additionally, do these metrics potentially suffer from a floor effect whereby models scoring zero become incomparable to each other?
Best, Anthony
Reason: Fixed line breaks