[Cat. 2] Discrepancies Between Validation and Preliminary Test Scores ¶
By: y-prudent on Sept. 5, 2024, 10 a.m.
Hello everyone,
I am currently working on the category 2 sub-challenge and have encountered some unexpected results that I would appreciate some insight on.
In our setup, we train on 265 videos and validate on 15 videos. We’ve made three submissions so far, and I’ve observed a significant discrepancy between the F1 scores on our validation set and the F1 scores from the preliminary test leaderboard. Specifically, our first submission, which performed the worst according to our validation set, achieved the best score on the preliminary test leaderboard (F1=0.739). Conversely, our third submission, which was clearly the best on our validation set, performed the worst on the preliminary leaderboard (F1=0.673).
To ensure the model was behaving as expected, we used the "try out algorithm" feature on one of our validation videos, and the results were consistent with our expectations. Has anyone else experienced similar discrepancies between validation and test evaluations? Could there be hidden major differences between the test and train videos (besides the 1 FPS difference)?
Additionally, I noticed that the Recall and Accuracy scores from the preliminary test were exactly the same across all our submissions. Could this imply that the classes are perfectly balanced in the test set?
Thank you in advance for your feedback!
Best Regards, Yannick
PS: I’d like to clarify that the intention behind this topic is not to try to manipulate the metrics, but to understand how to select the best approach for the final test set. We currently encounter a dilemma, as we have developed an algorithm that performs well in the operational context of full videos (end-to-end surgeries). In our opinion, this would be the most useful algorithm, but despite performing better on full, held-out videos (our validation set), this algorithm performs worse than our baseline in the preliminary tests.