简体   繁体   English

从 memory 多处理写入大文件

[英]Multiprocessing write to large file out of memory

I have the following script which works well for writing smaller datasets out to a file, but eventually runs out of memory when processing and writing larger datasets.我有以下脚本,可以很好地将较小的数据集写入文件,但在处理和写入较大的数据集时最终会用完 memory。 Some files sizes will be 60gb +.某些文件大小为 60gb +。

def do_work(index):
    ref_feature = layer.GetFeature(index)
    if ref_feature:
        try:
            return ref_feature.ExportToJson(as_object=True)
        except Exception as e:
            pass
    return None


def run_mp():
    # empty file contents
    open(f"{out_dir}/{fc_name}.geojsonseq", "w", encoding='utf8').close()

    # initiate multiprocessing
    pool = Pool(cpu_count())
    fc = layer.GetFeatureCount()
    resultset = pool.imap_unordered(do_work, range(fc), chunksize=1000)

    # this part is done after all results are ready, resulting in huge memory storage until results are written
    with open(f"{out_dir}/{fc_name}.geojsonseq", 'a') as file:
        for obj in resultset:
            file.write(f"\x1e{json.dumps(obj)}\n")


if __name__ == '__main__':
    seg_start = time.time()
    run_mp()
    print(f' completed in {time.time() - seg_start}')

Question:问题:

Is there a way to stream the results directly out to a file without building it up in memory and dumping it out to a file at the end?有没有办法将 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 的结果直接输出到文件中,而无需在 memory 中构建并在最后将其转储到文件中?

Since imap_unordered doesn't apply any back pressure to the worker processes, I suspect the results are backing up in the internal results queue of IMapUnorderedIterator .由于imap_unordered不会对工作进程施加任何背压,我怀疑结果正在IMapUnorderedIterator的内部结果队列中备份。 If that's the case, you have three options:如果是这种情况,您有三个选择:

  • Write the results faster in the main process.在主进程中更快地写出结果。 Try returning the string f"\x1e{json.dumps(obj)}\n" from your workers rather than dumping in the main process.尝试从您的工作人员那里返回字符串f"\x1e{json.dumps(obj)}\n"而不是在主进程中转储。 If that doesn't work:如果这不起作用:
  • Write temporary files in the workers and concatenate them in a second pass in the main process.在工作人员中写入临时文件,并在主进程的第二遍中将它们连接起来。 Workers will interfere with each other's writes if you try to have them all append the final file simultaneously.如果您尝试将它们全部 append 同时作为最终文件,工作人员将相互干扰。 You should be able to do this using minimal extra disk space.您应该能够使用最少的额外磁盘空间来执行此操作。 Note that you can do json.dump directly into a file object.请注意,您可以将json.dump直接转储到文件 object 中。 Alternatively you could guard worker writes to the same file with a multiprocessing.Lock .或者,您可以使用multiprocessing.Lock保护工作人员对同一文件的写入。 If the extra writes are too time consuming:如果额外的写入太耗时:
  • Manage back pressure yourself.自己管理背压。 Use Pool.apply_async or ProcessPoolExecutor.submit to start cpu_count jobs and only submit additional work after writing a result to disk.使用Pool.apply_asyncProcessPoolExecutor.submit启动cpu_count作业,并且仅在将结果写入磁盘后提交额外的工作。 It's less automatic than Pool.imap_unordered but that's the kind of thing you have to deal with when you're data starts getting big!它不像Pool.imap_unordered那样自动化,但是当你的数据开始变大时,这是你必须处理的事情!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM