Too many open files error when training in loop

Question

I have a pretty simple setup like this:

while True:
   model.fit(mySeqGen(xx), ..., use_multiprocessing=True, workers=8)
   <stuff>
   model.save_weights(fname)
   gc.collect()

Which runs for a long time, but if left overnight I will find it generating OSError: [Errno 24] Too many open files every loop iteration. The full stack trace is on another machine, but it has multiple references to multiprocessing.

This is surely not related to actual files, but byproducts of threads being created under the hood and not cleaned up properly. Is there a simple way I can make this loop stable over the long run and clean up after itself each pass?

Answer 1

It may be a system limitation.

Enter the following command:

$ ulimit -n

one thousand and twenty-four

The result is 1024, which means that the system is limited to 1024 files open at the same time.

Modification method:

Reduce the number of threads to meet this limit (just method).
Increase this limit.

a. Ulimit - N 2048 (this method is temporarily modified and is currently valid. Restore the original setting after exiting)

b. Modify the following files

sudo vim /etc/security/limits. conf

soft nofile 2048

hard nofile 2048

Restart after saving.

*unlimited

Data segment length: ulimit - D unlimited

Maximum memory size: ulimit - M unlimited

Stack size: ulimit - s unlimited

Answer 2

You can monitor the number of threads, connections and files opened by the process:

import psutil

p = psutil.Process()
while True:
   model.fit(mySeqGen(xx), ..., use_multiprocessing=True, workers=8)
   <stuff>
   model.save_weights(fname)
   gc.collect()
   print(f'files={len(p.open_files())} conn={len(p.connections(kind='tcp'))} threads={p.num_threads()}')

and at least you would know what problem to solve.

Too many open files error when training in loop

Question

2 answers

solution1
0 2022-04-14 05:20:11

solution2
0 2022-04-20 05:22:52

Too many open files error when training in loop

Question

2 answers

solution1 0 2022-04-14 05:20:11

solution2 0 2022-04-20 05:22:52

solution1
0 2022-04-14 05:20:11

solution2
0 2022-04-20 05:22:52