Python3 中 concurrent.futures.ThreadPoolExecutor 的内存使用情况

Question

I am building a script to download and parse benefits information for health insurance plans on Obamacare exchanges.我正在构建一个脚本来下载和解析奥巴马医改交易所的健康保险计划的福利信息。 Part of this requires downloading and parsing the plan benefit JSON files from each individual insurance company.其中一部分需要从每个保险公司下载和解析计划福利 JSON 文件。 In order to do this, I am using concurrent.futures.ThreadPoolExecutor with 6 workers to download each file (with urllib), parse and loop thru the JSON and extract the relevant info (which is stored in nested dictionary within the script).为了做到这一点，我使用concurrent.futures.ThreadPoolExecutor和 6 个工作人员来下载每个文件（使用 urllib），解析和循环遍历 JSON 并提取相关信息（存储在脚本内的嵌套字典中）。

(running Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:38:48) [MSC v.1900 32 bit (Intel)] on win32) （在 win32 上运行 Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:38:48) [MSC v.1900 32 位（英特尔）]）

The problem is that when I do this concurrently, the script does not seem to release the memory after it has downloaded\\parsed\\looped thru a JSON file, and after a while, it crashes, with malloc raising a memory error.问题是，当我同时执行此操作时，脚本在通过 JSON 文件下载\\解析\\循环后似乎没有释放内存，过了一会儿，它崩溃了， malloc引发了内存错误。

When I do it serially--with a simple for in loop-- however,the program does not crash nor does it take an extreme amount of memory.当我连续执行它时——使用一个简单的for in循环——但是，程序不会崩溃，也不会占用大量内存。

def load_json_url(url, timeout):
    req = urllib.request.Request(url, headers={ 'User-Agent' : 'Mozilla/5.0' })
    resp = urllib.request.urlopen(req).read().decode('utf8')
    return json.loads(resp) 



 with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
        # Start the load operations and mark each future with its URL
        future_to_url = {executor.submit(load_json_url, url, 60): url for url in formulary_urls}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                # The below timeout isn't raising the TimeoutError.
                data = future.result(timeout=0.01)
                for item in data:
                        if item['rxnorm_id']==drugid: 
                            for row in item['plans']:
                                print (row['drug_tier'])
                                (plansid_dict[row['plan_id']])['drug_tier']=row['drug_tier']
                                (plansid_dict[row['plan_id']])['prior_authorization']=row['prior_authorization']
                                (plansid_dict[row['plan_id']])['step_therapy']=row['step_therapy']
                                (plansid_dict[row['plan_id']])['quantity_limit']=row['quantity_limit']

            except Exception as exc:
                print('%r generated an exception: %s' % (url, exc))


            else:
                downloaded_plans=downloaded_plans+1

Answer 1

It's not your fault.这不是你的错。 as_complete() doesn't release its futures until it completes. as_complete()在完成之前不会发布其期货。 There's an issue logged already: https://bugs.python.org/issue27144已经记录了一个问题： https : //bugs.python.org/issue27144

For now, I think the majority approach is to wrap as_complete() inside another loop that chunkify to a sane number of futures, depending on how much RAM you want to spend and how big your result will be.目前，我认为大多数方法是将 as_complete() 包装在另一个循环中，该循环将分块到合理数量的期货，具体取决于您想要花费多少 RAM 以及您的结果有多大。 It'll block on each chunk until all job is gone before going to next chunk so be slower or potentially stuck in the middle for a long time, but I see no other way for now, though will keep this answer posted when there's a smarter way.它会在每个块上阻塞，直到所有作业都消失，然后再转到下一个块，因此速度较慢或可能会卡在中间很长一段时间，但我现在看不到其他方法，尽管会在有更智能的情况下发布此答案办法。

Answer 2

As an alternative solution, you can call add_done_callback on your futures and not use as_completed at all.作为替代解决方案，您可以在期货上调用add_done_callback ，而根本不使用as_completed 。 The key is NOT keeping references to futures.关键是不要保留对期货的引用。 So future_to_url list in original question is a bad idea.所以原始问题中的future_to_url列表是一个坏主意。

What I've done is basically:我所做的基本上是：

def do_stuff(future):
    res = future.result()  # handle exceptions here if you need to

f = executor.submit(...)
f.add_done_callback(do_stuff)

Answer 3

If you use the standard module “concurrent.futures” and want to simultaneously process several million data, then a queue of workers will take up all the free memory.如果您使用标准模块“concurrent.futures”并希望同时处理数百万条数据，那么一个工作队列将占用所有空闲内存。

You can use bounded-pool-executor .您可以使用bounded-pool-executor 。 https://github.com/mowshon/bounded_pool_executor https://github.com/mowshon/bounded_pool_executor

pip install bounded-pool-executor

example:例子：

from bounded_pool_executor import BoundedProcessPoolExecutor
from time import sleep
from random import randint

def do_job(num):
    sleep_sec = randint(1, 10)
    print('value: %d, sleep: %d sec.' % (num, sleep_sec))
    sleep(sleep_sec)

with BoundedProcessPoolExecutor(max_workers=5) as worker:
    for num in range(10000):
        print('#%d Worker initialization' % num)
        worker.submit(do_job, num)

Answer 4

dodysw has correctly pointed out that the common solution is to chunkify the inputs and submit chunks of tasks to the executor. dodysw 正确地指出，常见的解决方案是将输入分块并将任务块提交给执行程序。 He has also correctly pointed out that you lose some performance by waiting for each chunk to be processed completely before starting to process the next chunk.他还正确地指出，在开始处理下一个块之前等待每个块被完全处理会损失一些性能。

I suggest a better solution that will feed a continuous stream of tasks to the executor while enforcing an upper bound on the maximum number of parallel tasks in order to keep the memory footprint low.我建议一个更好的解决方案，它将向执行程序提供连续的任务流，同时强制执行最大并行任务数的上限，以保持较低的内存占用。

The trick is to use concurrent.futures.wait to keep track of the futures that have been completed and those that are still pending completion:诀窍是使用concurrent.futures.wait来跟踪已完成的期货和尚未完成的期货：

def load_json_url(url):
    try:
        req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        resp = urllib.request.urlopen(req).read().decode('utf8')
        return json.loads(resp), None
    except Exception as e:
        return url, e

MAX_WORKERS = 6
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    futures_done = set()
    futures_notdone = set()
    for url in formulary_urls:
        futures_notdone.add(executor.submit(load_json_url, url))

        if len(futures_notdone) >= MAX_WORKERS:
            done, futures_notdone = concurrent.futures.wait(futures_notdone, return_when=concurrent.futures.FIRST_COMPLETED)
            futures_done.update(done)

# Process results.
downloaded_plans = 0
for future in futures_done:
    json, exc = future.result()
    if exc:
        print('%r generated an exception: %s' % (json, exc))
    else:
        downloaded_plans += 1
        for item in data:
            if item['rxnorm_id'] == drugid:
                for row in item['plans']:
                    print(row['drug_tier'])
                    (plansid_dict[row['plan_id']])['drug_tier'] = row['drug_tier']
                    (plansid_dict[row['plan_id']])['prior_authorization'] = row['prior_authorization']
                    (plansid_dict[row['plan_id']])['step_therapy'] = row['step_therapy']
                    (plansid_dict[row['plan_id']])['quantity_limit'] = row['quantity_limit']

Of course, you could also process the results inside the loop regularly in order to empty the futures_done from time to time.当然，您也可以定期处理循环内的结果，以便不时清空futures_done 。 For example, you could do that each time the number of items in futures_done exceeds 1000 (or any other amount that fits your needs).例如，每次futures_done的项目futures_done超过 1000（或任何其他符合您需求的数量）时，您都可以这样做。 This might come in handy if your dataset is very large and the results alone would result in a lot of memory usage.如果您的数据集非常大并且仅结果会导致大量内存使用，这可能会派上用场。

Python3 中 concurrent.futures.ThreadPoolExecutor 的内存使用情况

问题描述

4 个解决方案

解决方案1
12 2016-10-25 16:48:59

解决方案2
7 2017-02-02 01:14:41

解决方案3
2 2018-12-22 13:48:18

解决方案4
0 2020-08-12 10:39:10

Python3 中 concurrent.futures.ThreadPoolExecutor 的内存使用情况

问题描述

4 个解决方案

解决方案1 12 2016-10-25 16:48:59

解决方案2 7 2017-02-02 01:14:41

解决方案3 2 2018-12-22 13:48:18

解决方案4 0 2020-08-12 10:39:10

解决方案1
12 2016-10-25 16:48:59

解决方案2
7 2017-02-02 01:14:41

解决方案3
2 2018-12-22 13:48:18

解决方案4
0 2020-08-12 10:39:10