[英]python multiprocessing set memory per process
我正在使用 python 对文本文件进行一些处理,并且遇到了MemoryError
的问题。 有时正在处理的文件非常大,这意味着多处理进程正在使用过多的 RAM。
这是我的代码片段:
import multiprocessing as mp
import os
def preprocess_file(file_path):
with open(file_path, "r+") as f:
file_contents = f.read()
# modify the file_contents
# ...
# overwrite file
f.seek(0)
f.write(file_contents)
f.truncate()
if __name__ == "main":
with mp.Pool(mp.cpu_count()) as pool:
pool_processes = []
# for all files in dir
for root, dirs, files in os.walk(some_path):
for f in files:
pool_processes.append(os.path.join(root, f))
# start the processes
pool.map(preprocess_file, pool_processes)
我尝试使用资源 package 来设置每个进程可以使用多少 RAM 的限制,如下所示,但这并没有解决问题,而且我仍然收到MemoryError
s,这让我相信它是pool.map
这导致了问题。 我希望每个进程单独处理异常,以便可以跳过文件而不是使整个程序崩溃。
import resource
def preprocess_file(file_path):
try:
hard = os.sysconf("SC_PAGE_SIZE") * os.sysconf("SC_PHYS_PAGES") # total bytes of RAM in machine
soft = (hard - 512 * 1024 * 1024) // mp.cpu_count() # split between each cpu and save 512MB for the system
resource.setrlimit(resource.RLIMIT_AS, (soft, hard)) # apply limit
with open(file_path, "r+") as f:
# ...
except Exception as e: # bad practice - should be more specific but just a placeholder
# ...
如何让单个进程用完 memory 而让其他进程继续不受影响? 理想情况下,我想在preprocess_file
文件中捕获异常,以便准确记录导致错误的文件。
编辑: preprocess_file
function 不与任何其他进程共享数据,因此不需要共享 memory。 function 还需要一次读取整个文件,因为文件被重新格式化,这不能逐行完成。
编辑2:程序的回溯如下。 如您所见,错误实际上并不指向正在运行的文件,而是来自包的文件。
Process ForkPoolWorker-2:
Traceback (most recent call last):
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 125, in worker
File "/usr/lib64/python3.6/multiprocessing/queues.py", line 341, in put
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 39, in __init__
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 130, in worker
File "/usr/lib64/python3.6/multiprocessing/queues.py", line 341, in put
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 39, in __init__
MemoryError
如果引发了MemoryError
,则工作进程可能无法从这种情况中恢复。 如果确实如此,正如@Thomas 建议的那样,请在某处捕获MemoryError
。
import multiprocessing as mp
from time import sleep
def initializer():
# Probably set the memory limit here
pass
def worker(i):
sleep(1)
try:
if i % 2 == 0:
raise MemoryError
except MemoryError as ex:
return str(ex)
return i
if __name__ == '__main__':
with mp.Pool(2, initializer=initializer) as pool:
tasks = range(10)
results = pool.map(worker, tasks)
print(results)
如果工人无法恢复,整个池不太可能工作。 例如,更改worker
以强制退出:
def worker(i):
sleep(1)
try:
if i % 2 == 0:
raise MemoryError
elif i == 5:
import sys
sys.exit()
except MemoryError as ex:
return str(ex)
return i
Pool.map
永远不会返回并永远阻塞。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.