Submission Failure: Error Analysis and Possibility of Modifying requirements.txt

Submission Failure: Error Analysis and Possibility of Modifying requirements.txt  

  By: chpark on April 14, 2025, 4:24 a.m.

I have built an initial version of the training pipeline and successfully registered an algorithm for the submission sanity check. However, the submission failed with the message: "The algorithm failed on one or more cases." (Prior to submission, I have verified that the algorithm runs successfully using do_test_run.sh in the provided environment without any issues.)

I suspect that the most likely reason for this issue is that I modified the provided requirements.txt file to match the library versions used in my local training environment when building the Docker image, as shown below:

--extra-index-url https://download.pytorch.org/whl/cu118
torch==2.4.1
torchvision==0.19.1
numpy==2.2.4
scipy==1.15.2
pandas==2.2.3
SimpleITK==2.4.1
tqdm

In this regard, I would like to kindly ask the following:

1) Is it possible to obtain a more detailed error log or analysis to identify the exact cause of the failure?

2) Am I allowed to modify the fixed versions in the provided requirements.txt file, or is there any flexibility to adjust specific library versions if necessary?

I would appreciate your guidance on this matter. Please let me know if you need any additional information from my side.

Thank you.

 Last edited by: chpark on April 14, 2025, 4:50 a.m., edited 6 times in total.

Re: Submission Failure: Error Analysis and Possibility of Modifying requirements.txt  

  By: drepeeters on April 14, 2025, 6:41 a.m.

Hi Chpark,

You can get a more detailed error log on the results page of your algorithm. You can find the results of your latest submission here.

The latest submission fails due to the following issue RuntimeError: Attempting to deserialize object on CUDA device 4 but torch.cuda.device_count() is 1.. This happens when a model checkpoint was saved on a machine with multiple GPU's, and in our submission environment you only have access to 1 GPU (cuda:0). In this scenario, PyTorch cannot automatically map the device of the checkpoint.

You could modify the line where the checkpoint is loaded to include map_location and use something like:
ckpt = torch.load(os.path.join(self.model_root, self.model_name, "model.pth"), map_location="cuda:0")
Please check the PyTorch documentation for specific information on the use of map_location here.

Regarding the changes in the requirements.txt, I do not see an issue with adjusting towards specific library versions. The current requirements.txt just ensures compatibility with our baseline algorithms.

Hopefully this answers your questions. Feel free to reach out again if necessary.

Kind regards,
Dre Peeters

Re: Submission Failure: Error Analysis and Possibility of Modifying requirements.txt  

  By: chpark on April 14, 2025, 7:22 a.m.

Thank you very much for your kind and detailed response.

I would like to let you know that I am unable to access the first link provided as "here" in your reply (the link for checking the detailed error log on the results page). It seems I do not have the necessary permission or access rights to view that page.

In order to handle similar issues more efficiently in future submissions, I would like to ask if there is any way for me to directly access or download the detailed error logs for my submission.

Additionally, I appreciate your guidance regarding the use of map_location in the checkpoint loading process. I will proceed accordingly based on your suggestions.

Thank you for your support.

Best regards, Changhyun

Re: Submission Failure: Error Analysis and Possibility of Modifying requirements.txt  

  By: chpark on April 14, 2025, 9:01 a.m.

I just learned how to use the Algorithms and Results pages, and I was able to check the details regarding the issue I asked about. Thank you for your support.