some question about baseline

some question about baseline  

  By: Hikaryy on Aug. 30, 2022, 1:26 a.m.

I downloaded baseline on Linux and selected one of network to run. As a result, its loss has a Nan value. Does anyone have this problem like me? At present, I'm still solving the problem. Setting the learning rate to 0 can't solve this problem

Re: some question about baseline  

  By: saqibali on Aug. 30, 2022, 4:07 a.m.

Hey,

There are a couple things you need to consider why loss could go NaN in the baseline.

  • Learning rate is too high. Try using 5e-5.
  • Network loss is fluctuating quite a bit.
  • Try using gradient clipping.

Hope that helps.

Re: some question about baseline  

  By: Hikaryy on Aug. 30, 2022, 6:46 a.m.

My learning rate has been set to 0, but this problem still exists, so it should not be the problem of too high learning rate I have found that in the train. py file, the Nan value will appear in the images of the inputs in some batches after pretreatment. If you encounter the same problem as me, you can refer to it

 Last edited by: Hikaryy on Aug. 15, 2023, 12:57 p.m., edited 1 time in total.

Re: some question about baseline  

  By: junma on Aug. 30, 2022, 6:39 p.m.

I guess some images may not be well saved when you run the preprocessing.

Re: some question about baseline  

  By: gciano on Aug. 31, 2022, 7:47 a.m.

We had the same problem, solved by changing the version of Pytorch. In particular, we had Nan values when we tried Pytorch 1.12.1 (Stable). With version 1.8.2 (LTS), however, the training was successful. Actually, even in the prediction phase we got different results using different versions of Pytorch. Therefore, I suggest you check if you have the same problem by changing the version. I hope I have been helpful.

Re: some question about baseline  

  By: junma on Sept. 2, 2022, 8:18 p.m.

Thanks for sharing:)

Re: some question about baseline  

  By: lmais on Sept. 13, 2022, 7:24 a.m.

Hi, I had the same problem. Unfortunately, changing the Pytorch version didn't work for me, but thank you for sharing. I got rid of the nan values by removing RandHistogramShiftd in the data augmentation part. Maybe it helps when someone else encounters the issue.

Re: some question about baseline  

  By: trinhvg on Sept. 14, 2022, 8:40 p.m.

I got this message during pre-processing: " OME series is BinaryOnly, not an OME-TIFF master file." The baseline code with the default setting becomes Nan after ~20 epochs. I don't know if it is the reason.

Re: some question about baseline  

  By: liuyanice098 on Sept. 16, 2022, 6:38 a.m.

I have met the same issue that the loss became Nan when training. I try one quick fix it to set squared_pred = False in DiceCELoss. Another fix would be to pass a larger value such as smooth_dr = 1e-5 to prevent division by zero. But both the two fixs are invalid.

Re: some question about baseline  

  By: liuyanice098 on Sept. 16, 2022, 7:10 a.m.

@lmais

Hi, your method works for me. Thank you.

Re: some question about baseline  

  By: lmais on Sept. 19, 2022, 8:49 p.m.

If you use monai version 0.9.0, it should also work with RandHistogramShift