Code for evaluation scores

Code for evaluation scores  

  By: lWM on Aug. 2, 2023, 9:39 a.m.

Hello,

Could you please share the source code used to evaluate the cut-off submissions (I mean: the functions you use to calculate mode, variance and skewness)?

I would like to check them before submitting the final attempt. Unfortunately, the scores I calculated offline significantly differ from the ones presented in the leaderboard (it seems I should have lower score in P3 and higher in P4).

Bests,

 Last edited by: lWM on Aug. 15, 2023, 12:59 p.m., edited 2 times in total.

Re: Code for evaluation scores  

  By: giansteve on Aug. 2, 2023, 11:03 a.m.

Hi,

We noticed a technical problem in the algorithm and are working on solving it. The submissions to phase 2 are currently closed (just some hours).

Most importantly, we will try to update the leaderboard as soon as possible. Meanwhile, I am pasting here your results for p3 and p4, which are considered valid for Phase 2. P3: 0,5941 P4: 1082,13

We thank you for your understanding. I am available for further questions.

Best Gian Marco

Re: Code for evaluation scores  

  By: lWM on Aug. 2, 2023, 12:49 p.m.

Thanks for your response. Indeed, the P3 I calculated offline was quite similar (just slightly lower - ~0.57 - I suppose the difference is due to the unbiased/biased std calculation), however the P4 was beyond 1000 and that made me wonder whether I understand the ranking method correctly.

By the way - which attempt will be used for the final ranking? The better one or the last one?

By the way 2 - one can theoretically win by submitting fake submission with Dice equal to 0 and significant HD (e.g. by placing one positive voxel in a corner of the prediction). It will result in standard deviation equal/close to 0 and almost infinite score - just saying :)

 Last edited by: lWM on Aug. 15, 2023, 12:59 p.m., edited 1 time in total.

Re: Code for evaluation scores  

  By: lWM on Aug. 2, 2023, 1:55 p.m.

And just another point - I know it may be a bit too late to change the methodology of calculating the final ranking. However, I would suggest to consider scoring all the statistics (mode / variance / skewness) separately and then introducing the weights (0.6, 0.25, 0.15) to accumulate the final ranking from each statistic. I quickly checked the values for the three available submissions and:

1) For Dice Score (P4): - lWM: Contribution of variance: 1081.31, Contribution of mode: 0.564, Contribution of skewness: 0.267 - 坤坤kk: Contribution of variance: 293.29, Contribution of mode: 0.556, Contribution of skewness: 13.14 -tpvagenas: Contribution of variance: 374.21, Contribution of mode: 0.546, Contribution of skewness: 1.384

1) For HD95 (P3): - lWM: Contribution of variance: 0.0044, Contribution of mode: 0.383, Contribution of skewness: 0.205 - 坤坤kk: Contribution of variance: 0.016, Contribution of mode: 0.299, Contribution of skewness: 0.820 -tpvagenas: Contribution of variance: 0.0082, Contribution of mode: 0.099, Contribution of skewness: 0.368

It shows that for the Dice Score the variance basically dominates everything else (several orders of magnitude higher contribution) while for HD95 the contribution of variance basically does not exists and the most important is the skewness of the distribution. Moreover, for skewness - it does not matter whether it is right- or left-skew (and that actually should matter both for P3 and P4).

Re: Code for evaluation scores  

  By: giansteve on Aug. 2, 2023, 2:26 p.m.

Thank you for all your BTWs and the detailed descriptions :)

BTW1) The updated values reported in the NEWS section will be used for the final ranking, hence "the last ones". We will try to update the leaderboard without having you submitting again.

BTW2) One theoretically can and hopefully will not get inspired by your message. However, we always check the results to avoid "deep fakes" ;)

BTW3) Thank you for the calculations ;) I noticed it too, but it was already too late indeed, at least for this phase of the challenge. The main idea was to merge the probability distributions of HD and DSC into the two metrics, P3 and P4 (rather than 6). An alternative to P3 and P4 would have been to use a statistical distance (e.g., the Kullback-Leibler divergence), but as already said, it was too late for this phase. We will consider your suggestions, take action if possible, and inform you all on the Grand-Challenge website in case of modifications in the algorithm.

Thank you again Best Gian Marco

Re: Code for evaluation scores  

  By: lWM on Aug. 2, 2023, 2:43 p.m.

Thanks for the clarification!

Regarding the BTW1) - In terms of the "the last one" - I mean - we have two attempts - which of these attempts will be used as the final in the cut-off phase? The last submitted one or the better from the two attempts?

Regarding the BTW3) - Yep, I fully understand that it may be a bit too late to change the game rules and I don't insist anything (actually the current formulation is quite good for my P4). However, let's assume a completely realistic submission with the following scores:

Dice: [0.752, 0.762, 0.743, 0.765, 0.735] HD95: [9.02, 10.23, 10.13, 10.62, 9.02]

Such a submission would be considered 1st in both P3 (1.308) and P4 (1570.74) and that.. does not make much sense. Thus, with the current version - the final ranking will be just random since it only includes the variance for Dice and skewness for HD95.

Bests,

Re: Code for evaluation scores  

  By: giansteve on Aug. 2, 2023, 2:49 p.m.

Thanks for this. We will discuss the metrics and find a solution to these "biases". As for the ranking, as stated on the ranking page, the first 3 users in the leaderboard will pass to phase 3 of the challenge. Therefore, the best attempts will always be ranked higher (considering a valid metric system) and will be considered as the final in the cut-off phase.

Best Gian

Re: Code for evaluation scores  

  By: lWM on Aug. 2, 2023, 2:52 p.m.

Thanks for the clarification regarding the final submission - will submit my last one after the ranking system will be "reconsidered" :)

Looking forward to any decisions related to the ranking system.

Bests,

 Last edited by: lWM on Aug. 15, 2023, 12:59 p.m., edited 1 time in total.