Available resources - RAM and multiprocessing

Available resources - RAM and multiprocessing  

  By: lWM on Feb. 22, 2024, 6:58 p.m.

Hello,

I would like to ask about the mentioned 64GB RAM for medium/large-gpu enviroments.

Each time attempting to run a job using the jobman I get the following error: "RuntimeError: DataLoader worker is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit", even though the process should not consume more than ~10 GB of RAM.

Is it blocked to use multiprocessing within the jobman jobs? I solved the error by setting the num_workers to 0 so only the main process loads the data, however such a solution is incredibly slow.

Bests,

Re: Available resources - RAM and multiprocessing  

  By: alvaroparicio on Feb. 27, 2024, 7:31 a.m.

Hi, sorry for the late reply. There is no limit explicitly configured for the amount of memory in this case. What we see in this link is that maybe the "shared memory" means the /dev/shm partition which in docker containers is limited by default to 64MB. We can try to increase depending on the rest of containers in the kubernetes node where job is running, but it can take some complexity. The other solution is to disable the use of shared memory by the dataloader in your algorithm (dataloader._use_shared_memory = False, see an example).

On the other hand, you say the solution of set num_workers to 0 is incredibly slow. We see you are requesting the large-gpu flavor but using the default image "ubuntu-python:latest" which not includes the CUDA libraries. So maybe you are not really using the GPU. You should include the argument "-i ubuntu-python:latest-cuda" in the command line with jobman. Hope we could help!

Re: Available resources - RAM and multiprocessing  

  By: lWM on Feb. 28, 2024, 1:26 p.m.

Hi,

It seems that the problem is caused by some GIL-related blocks on the platform. I checked and even though I use the "ubuntu-python:latest" image my code correctly uses the GPU and CUDA libraries.

Finally I managed to somehow train the networks (they are lightweight), however, being able to perform the multiprocessing loading would probably speed-up the training significantly.

Bests,

Re: Available resources - RAM and multiprocessing  

  By: agaldran on March 3, 2024, 1:26 a.m.

Hello,

I believe I might be having the same problem, see screenshot. Any clue? Thanks!

Re: Available resources - RAM and multiprocessing  

  By: lWM on March 4, 2024, 7:55 p.m.

This is exactly the problem I encountered - the only way to solve it that worked for me was to set the number of workers to 0.

Re: Available resources - RAM and multiprocessing  

  By: alvaroparicio on March 5, 2024, 7:41 a.m.

Hi, we finally increased the limit of shared memory (i.e. /dev/shm) to the half of the RAM memory (aprox.) for each case: 14 GB for small-gpu and 30 GB for medium-gpu, large-gpu and no-gpu, and 3 GB for desktops. So you can test again if the dataloader's worker is killed.

Re: Available resources - RAM and multiprocessing  

  By: lWM on March 5, 2024, 9:18 a.m.

Thank you. It works now and the training speed is increased.