使用并发期货python 3.5进行大文件处理的最快方法

Question

I am trying to grasp multithreading/multiprocessing using concurrent futures. 我正在尝试使用并发期货来掌握多线程/多处理。

I have tried using the following sets of code. 我尝试使用以下代码集。 I understand that I will always have the disk IO problem, but I want to maximize my ram and CPU usage to the fullest extent possible. 我知道我将始终遇到磁盘IO问题，但是我想最大程度地最大化RAM和CPU使用率。

What method is the most used/best method for large scale processing? 大规模处理中最常用/最好的方法是什么？

How do you use concurrent futures for processing large datasets? 您如何使用并发期货来处理大型数据集？

Is there a more preferred method than the ones below? 有没有比下面更优选的方法？

Method 1: 方法1：

for folders in os.path.isdir(path):
    p = multiprocessing.Process(pool.apply_async(process_largeFiles(folders)))
    jobs.append(p)
    p.start()

Method 2: 方法2：

with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
    for folders in os.path.isdir(path):
        executor.submit(process_largeFiles(folders), 100)

Method 3: 方法3：

with concurrent.futures.ProcessPoolExecutor(max_workers=10) as executor:
    for folders in os.path.isdir(path):
        executor.submit(process_largeFiles(folders), 10)

Should I attempt to use process pool and thread pool together? 我是否应该尝试同时使用进程池和线程池？

Method (thought): 方法（思想）：

with concurrent.futures.ProcessPoolExecutor(max_workers=10) as process:
     with concurrent.futures.ThreadPoolExecutor(max_workers=100) as thread:
          for folders in os.path.isdir(path):
              process.submit(thread.submit(process_largeFiles(folders), 100),10)

What is the most efficient method to maximize my ram and cpu in the broadest use case? 在最广泛的用例中，最大化我的ram和cpu的最有效方法是什么？

I am aware that starting processes takes a bit of time, but would it be outweighed with the size of my files being processed? 我知道启动过程会花费一些时间，但是它会超过我正在处理的文件大小吗？

Answer 1

Use TreadPoolExecutor to open and read the files then use ProcessPoolExecutor to process the data. 使用TreadPoolExecutor打开和读取文件，然后使用ProcessPoolExecutor处理数据。

import concurrent.futures
from collections import deque

TPExecutor = concurrent.futures.ThreadPoolExecutor
PPExecutor = concurrent.futures.ProcessPoolExecutor
def get_file(path):
    with open(path) as f:
        data = f.read()
    return data

def process_large_file(s):
    return sum(ord(c) for c in s)

files = [filename1, filename2, filename3, filename4, filename5,
         filename6, filename7, filename8, filename9, filename0]

results = []
completed_futures = collections.deque()

def callback(future, completed=completed_futures):
    completed.append(future)

with TPExecutor(max_workers = 4) as thread_pool_executor:
    data_futures = [thread_pool_executor.submit(get_file, path) for path in files]
with PPExecutor() as process_pool_executor:
    for data_future in concurrent.futures.as_completed(data_futures):
        future = process_pool_executor.submit(process_large_file, data_future.result())
        future.add_done_callback(callback)
        # collect any that have finished
        while completed_futures:
            results.append(completed_futures.pop().result())

Used a done callback so it wouldn't have to wait for completed futures. 使用完成的回调，因此不必等待完成的期货。 I don't have any idea how that affects efficiency - used it mainly to simplify the logic/code in the as_completed loop. 我不知道它如何影响效率 -主要用于简化as_completed循环中的逻辑/代码。

If you need to throttle file or data submissions due to memory constraints it would need to be refactored. 如果由于内存限制而需要限制文件或数据提交，则需要对其进行重构。 Depending file read time and processing time it is hard to say how much data will be in memory at any given moment. 根据文件读取时间和处理时间，很难说出在任何给定时刻内存中将存储多少数据。 I think gathering results in the as_completed should help mitigate that. 我认为在as_completed收集结果应有助于缓解这种情况。 data_futures may start completing while the ProcessPoolExecutor gets set up - that sequencing may need to be optimized. 在设置ProcessPoolExecutor时， data_futures可能会开始完成-可能需要优化排序。

使用并发期货python 3.5进行大文件处理的最快方法

问题描述

1 个解决方案

解决方案1
0 2018-07-07 18:43:56

使用并发期货python 3.5进行大文件处理的最快方法

问题描述

1 个解决方案

解决方案1 0 2018-07-07 18:43:56

解决方案1
0 2018-07-07 18:43:56