[英]Multiprocessing write to large file out of memory
I have the following script which works well for writing smaller datasets out to a file, but eventually runs out of memory when processing and writing larger datasets.我有以下脚本,可以很好地将较小的数据集写入文件,但在处理和写入较大的数据集时最终会用完 memory。 Some files sizes will be 60gb +.
某些文件大小为 60gb +。
def do_work(index):
ref_feature = layer.GetFeature(index)
if ref_feature:
try:
return ref_feature.ExportToJson(as_object=True)
except Exception as e:
pass
return None
def run_mp():
# empty file contents
open(f"{out_dir}/{fc_name}.geojsonseq", "w", encoding='utf8').close()
# initiate multiprocessing
pool = Pool(cpu_count())
fc = layer.GetFeatureCount()
resultset = pool.imap_unordered(do_work, range(fc), chunksize=1000)
# this part is done after all results are ready, resulting in huge memory storage until results are written
with open(f"{out_dir}/{fc_name}.geojsonseq", 'a') as file:
for obj in resultset:
file.write(f"\x1e{json.dumps(obj)}\n")
if __name__ == '__main__':
seg_start = time.time()
run_mp()
print(f' completed in {time.time() - seg_start}')
Question:问题:
Is there a way to stream the results directly out to a file without building it up in memory and dumping it out to a file at the end?有没有办法将 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 的结果直接输出到文件中,而无需在 memory 中构建并在最后将其转储到文件中?
Since imap_unordered
doesn't apply any back pressure to the worker processes, I suspect the results are backing up in the internal results queue of IMapUnorderedIterator
.由于
imap_unordered
不会对工作进程施加任何背压,我怀疑结果正在IMapUnorderedIterator
的内部结果队列中备份。 If that's the case, you have three options:如果是这种情况,您有三个选择:
f"\x1e{json.dumps(obj)}\n"
from your workers rather than dumping in the main process.f"\x1e{json.dumps(obj)}\n"
而不是在主进程中转储。 If that doesn't work:json.dump
directly into a file object.json.dump
直接转储到文件 object 中。 Alternatively you could guard worker writes to the same file with a multiprocessing.Lock
.multiprocessing.Lock
保护工作人员对同一文件的写入。 If the extra writes are too time consuming:Pool.apply_async
or ProcessPoolExecutor.submit
to start cpu_count
jobs and only submit additional work after writing a result to disk.Pool.apply_async
或ProcessPoolExecutor.submit
启动cpu_count
作业,并且仅在将结果写入磁盘后提交额外的工作。 It's less automatic than Pool.imap_unordered
but that's the kind of thing you have to deal with when you're data starts getting big!Pool.imap_unordered
那样自动化,但是当你的数据开始变大时,这是你必须处理的事情!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.