Update on PUMA Challenge Scoring Issue

Update on PUMA Challenge Scoring Issue ¶

By: mschuiveling on March 13, 2025, 12:05 p.m.

Dear PUMA Challenge participants,

We've identified an issue with the way the nuclei F1 score is calculated on the Grand Challenge platform. Due to the evaluation container processing samples individually, the F1 score displayed is an average of the per-sample F1 scores rather than a single metric calculated using the total TP, FP, and FN across all samples. While this is still a valid metric, it is not the intended method of evaluation.

To address this, we decided the following approach:

We will use both the currently displayed averaged F1 score and the F1 score computed using the total TP, FP, and FN across all samples to determine the top three participating teams.
The final results will be posted after the challenge deadline on Friday.

We are very sorry for the inconvenience and inconsistency this has led to. To ensure the integrity of the evaluation, we have also rechecked the rest of the evaluation process, which remains solid.

If you have any questions, please do not hesitate to contact us through a reply on this post or an email to m.schuiveling@umcutrecht.nl.

Kind regards,

On behalf of the PUMA challenge team,
Mark

Re: Update on PUMA Challenge Scoring Issue ¶

By: wildsquirrel on March 13, 2025, 4:12 p.m.

Hi, Please can you confirm the final deadline?

On here you said the deadline is Friday, but on the Submit page I see Submissions to this phase will automatically close at March 16, 2025, 11:59 a.m. (Europe/London)., which is Sunday.

On the Timeline page, it says the deadline is: March 15 23:59 (AoE), 2025.

Please can you confirm here? Thank you.

Re: Update on PUMA Challenge Scoring Issue ¶

By: rictoo on March 13, 2025, 10:11 p.m.

I second wildsquirrel's question. On the Info page (and the forum thread) it was stated that the deadline was Saturday, March 15 23:59 (Anywhere on Earth).

Also when you say "We will use both the currently displayed averaged F1 score and the F1 score computed using the total TP, FP, and FN across all samples to determine the top three participating teams." - how exactly will you be using the averaged F1 score vs F1 computed across the samples? An average?

Re: Update on PUMA Challenge Scoring Issue ¶

By: mschuiveling on March 14, 2025, 7:14 a.m.

Dear wildsquirrel and rictoo,

"Anywhere on Earth" means that the deadline has passed everywhere on the planet. This corresponds to Sunday 11:59 a.m. in Europe, ensuring that all participants have a fair chance to work on the challenge for one final day.

Scoring¶

To determine the top three teams in each track, we will use both the different F1 score calculations.
This means that for each track, there will be two separate rankings based on these scoring methods. The final results will be announced after the challenge deadline on Friday and will look like the following (for each track):

Ranking Based on Current Averaged F1 Score¶

Rank	Team	Avg. F1	Tissue Dice	Combined Mean Ranking of F1 and Tissue
1st	TBD	TBD	TBD	TBD
2nd	TBD	TBD	TBD	TBD
3rd	TBD	TBD	TBD	TBD

Ranking Based on Summed F1 Score¶

Rank	Team	Sum F1	Tissue Dice	Combined Mean Ranking of F1 and Tissue
1st	TBD	TBD	TBD	TBD
2nd	TBD	TBD	TBD	TBD
3rd	TBD	TBD	TBD	TBD

I am sorry if my earlier posts were unclear. Please let me know if you have further questions!

Good luck with the final days of the challenge.

Kind regards, Mark

Re: Update on PUMA Challenge Scoring Issue ¶

By: agaldran on March 14, 2025, 4:01 p.m.

Hello Mark,

I was wondering if the same considerations apply also to tissue segmentation? Since I guess GC is also computing per-image dice scores, and I think the instructions were also to use all pixels from all data to compute a macro-dice, correct?

Cheers,

Adrian

Re: Update on PUMA Challenge Scoring Issue ¶

By: mschuiveling on March 14, 2025, 5 p.m.

Hi Adrian,

You're absolutely right, so we double-checked to be sure. We're glad to confirm that this was accounted for during the container's creation. The mask is generated after processing all samples, meaning the tissue metric calculation remains as intended—DICE score on a large mask of 1024 × (1024 × the number of samples).

Best regards, Mark