And just another point - I know it may be a bit too late to change the methodology of calculating the final ranking. However, I would suggest to consider scoring all the statistics (mode / variance / skewness) separately and then introducing the weights (0.6, 0.25, 0.15) to accumulate the final ranking from each statistic. I quickly checked the values for the three available submissions and:
1) For Dice Score (P4):
- lWM: Contribution of variance: 1081.31, Contribution of mode: 0.564, Contribution of skewness: 0.267
- 坤坤kk: Contribution of variance: 293.29, Contribution of mode: 0.556, Contribution of skewness: 13.14
-tpvagenas: Contribution of variance: 374.21, Contribution of mode: 0.546, Contribution of skewness: 1.384
1) For HD95 (P3):
- lWM: Contribution of variance: 0.0044, Contribution of mode: 0.383, Contribution of skewness: 0.205
- 坤坤kk: Contribution of variance: 0.016, Contribution of mode: 0.299, Contribution of skewness: 0.820
-tpvagenas: Contribution of variance: 0.0082, Contribution of mode: 0.099, Contribution of skewness: 0.368
It shows that for the Dice Score the variance basically dominates everything else (several orders of magnitude higher contribution) while for HD95 the contribution of variance basically does not exists and the most important is the skewness of the distribution. Moreover, for skewness - it does not matter whether it is right- or left-skew (and that actually should matter both for P3 and P4).