Hey @LuukBoulogne,
first of all, thank you for your time and effort in helping us to identify this issue.
We tried to reproduce this memory leak on mulitple machines without any success, sadly.
Applying the container locally on the training set resulted into a RAM usage of < 16GB (~8-11GB) for utilizing our Tensorflow model.
As @miriamelia already said, would it be possible that you can give us some more information like exact error messages or after how many samples the error occurs?
What we currently tried:
- We observed that Tensorflow is piling up some data after each predict() call
- To solve this issue we are now running building the TF stack for each prediction in a separate process
- This costs of course prediction time, but allows us to clean up all leftover variables in the RAM (Garbage collection on TF graphs or session clear is still not fixed by TF sadly)
- Furthermore, we manually reduced the number of threads to 5 (instead of the default TF option to use all core threads it can get) in order to save more RAM and ensure the --pids-limit
Also thank you for the great organization of this challenge!
Cheers,
Dominik