[Cat. 2] Discrepancies Between Validation and Preliminary Test Scores

[Cat. 2] Discrepancies Between Validation and Preliminary Test Scores  

  By: y-prudent on Sept. 5, 2024, 10 a.m.

Hello everyone,

I am currently working on the category 2 sub-challenge and have encountered some unexpected results that I would appreciate some insight on.

In our setup, we train on 265 videos and validate on 15 videos. We’ve made three submissions so far, and I’ve observed a significant discrepancy between the F1 scores on our validation set and the F1 scores from the preliminary test leaderboard. Specifically, our first submission, which performed the worst according to our validation set, achieved the best score on the preliminary test leaderboard (F1=0.739). Conversely, our third submission, which was clearly the best on our validation set, performed the worst on the preliminary leaderboard (F1=0.673).

To ensure the model was behaving as expected, we used the "try out algorithm" feature on one of our validation videos, and the results were consistent with our expectations. Has anyone else experienced similar discrepancies between validation and test evaluations? Could there be hidden major differences between the test and train videos (besides the 1 FPS difference)?

Additionally, I noticed that the Recall and Accuracy scores from the preliminary test were exactly the same across all our submissions. Could this imply that the classes are perfectly balanced in the test set?

Thank you in advance for your feedback!

Best Regards, Yannick


PS: I’d like to clarify that the intention behind this topic is not to try to manipulate the metrics, but to understand how to select the best approach for the final test set. We currently encounter a dilemma, as we have developed an algorithm that performs well in the operational context of full videos (end-to-end surgeries). In our opinion, this would be the most useful algorithm, but despite performing better on full, held-out videos (our validation set), this algorithm performs worse than our baseline in the preliminary tests.

 Last edited by: y-prudent on Sept. 6, 2024, 7:31 a.m., edited 2 times in total.

Re: [Cat. 2] Discrepancies Between Validation and Preliminary Test Scores  

  By: aneeqzia_isi on Sept. 5, 2024, 6:53 p.m.

Hi Yannick,

Thanks for reaching out and posting your question here.

Please note that the prelim testing phases are more targeted towards having a working algorithm container for the teams so that the final submission can be made without any issue. The performances on the prelim testing phases will most likely not be indicative of the performance on the final testing phase as it has <5% of the total videos in the final testing phase. It is up to the teams to decide which algorithm they would like to submit in the final phase since a lower performing algorithm may perform better in the final testing phase due to the difference in data size. Hope this gives some clarity but please feel free to reach out with further questions.

Best, SurgVU team

Re: [Cat. 2] Discrepancies Between Validation and Preliminary Test Scores  

  By: y-prudent on Sept. 9, 2024, 11:36 a.m.

Hi,

Thank you for the clarification, this helps to put things into perspective. It’s good to know that the preliminary test is more focused on verifying the container's functionality rather than being indicative of final performance. The fact that the preliminary test only includes a small portion of the final dataset also explains some of the discrepancies we’re seeing.

We’ll keep this in mind as we decide which algorithm to submit for the final phase.

Best regards, Yannick