Dear organizers,

Are there any differences on the evaluation methods, i.e., calculating metrics, between Phase 1 and Phase 2 ? I understand that the data used in Phase 2 is identical to that used in Phase 1, which includes 20 validation data cases. I run my docker and produced the prediction results locally to check the results in Phase 2. I compared the results in Phase 1 and Phase 2 locally, and confirmed these were identical. But, the results obtained in Phase 1 and Phase2 (in submission system) were much different. Can you think of anything that might have caused it?