Notes on empty predicted masks

Notes on empty predicted masks  

  By: tom-julius on June 4, 2025, 10:31 a.m.

Dear participants,

we want to notify you about the following updated paragraph of the metrics page: https://trackrad2025.grand-challenge.org/metrics/


Missing output on single frames

In cases where an algorithm produces no output on a given frame where a ground truth label is available, the following default metric values will be used: DSC=0, HD95/MASD/CD=image size along the largest dimension in mm, dose set to zero for that frame, inference time calculated as in normal cases.

ATTENTION!

These default metric values are very bad and it is highly recommended to prevent them from being applied. We recommend repeating the previous non-empty predicted frame, but other approaches may be useful too. This can be implemented by adding:

# Check for empty predictions and use last non-empty prediction.
predicted_labels = ... # assuming predicted_labels.shape = (W, H, T)
for i in range(1, predicted_labels.shape[2]):
    pass # generate prediction.

    # Check for empty predictions and use previous non-empty prediction.  
    if np.sum(predicted_labels[:, :, i]) == 0:
        predicted_labels[:, :, i] = predicted_labels[:, :, i - 1]

Sincerely, Tom

 Last edited by: tom-julius on June 4, 2025, 10:36 a.m., edited 1 time in total.

Re: Notes on empty predicted masks  

  By: Cedric on June 4, 2025, 3:43 p.m.

Dear Tom,

Thank you for the update and the clarification regarding handling empty predictions.

I would like to kindly ask whether it would be possible to obtain the Dice scores per frame for the test set. At the moment, this level of detail is not accessible through the current evaluation interface.

Best regards, Cédric

Re: Notes on empty predicted masks  

  By: tom-julius on June 4, 2025, 7:18 p.m.

Hello Cédric,

thank you for your interest in our challenge.

In general we assumed that per-frame metrics should not be provided for the (final) testing set, as they could allow/encourage overfitting.

For this purpose, the public labeled dataset can and should be used. For example by extending the evaluation script here.

For the preliminary testing set maybe an argument could be made, that per-frame metrics allow participants to potentially debug very specific issues. But I'd hold against this argument, that the provided per-case mean, std, min, and max values should give enough of an indication of whether any catastrophic failure did occur. And for detailed debugging a labeled case can be used.

But if you can provide a good enough argument I might be able to convince the other organisers to allow per-frame metrics to be provided. While I would refrain from updating the evaluation interface, I would be able download the predictions for any given submission and compute the per-frame metrics locally.

Sincerely, Tom

 Last edited by: tom-julius on June 4, 2025, 10:32 p.m., edited 1 time in total.

Re: Notes on empty predicted masks  

  By: Cedric on June 4, 2025, 9:14 p.m.

Dear Tom,

Thank you for your detailed response.

No problem at all—we fully understand the reasoning behind restricting access to per-frame metrics for the test set to avoid encouraging overfitting. Our request was primarily driven by curiosity, as we were interested in checking whether our model's performance tends to degrade progressively over the temporal sequence.

That said, we’ll conduct our checks locally using the labeled data as suggested. Thanks again for your support and for the well-organized challenge.

Best regards, Cédric