在中等大小的 JSON 文件上使用线程池进行同步读取比异步读取更快

Question

异步比同步慢的答案没有涵盖我正在处理的场景，因此这个问题。

I'm using Python 3.6.0 on Windows 10 to read 11 identical JSON files named k80.json to k90.json , 18.1 MB each.

首先，我尝试同步、顺序读取所有 11 个文件。 完成时间为5.07秒。

from json import load
from os.path import join
from time import time


def read_config(fname):
    json_fp = open(fname)
    json_data = load(json_fp)
    json_fp.close()

    return len(json_data)


if __name__ == '__main__':
    NUM_THREADS = 12
    idx = 0
    in_files = [join('C:\\', 'Users', 'userA', 'Documents', f'k{idx}.json') for idx in range(80, 91)]

    print('Starting sequential run.')
    start_time1 = time()

    for fname in in_files:
        print(f'Reading file: {fname}')
        print(f'The JSON file size is {read_config(fname)}')

    read_duration1 = round(time() - start_time1, 2)

    print('Ending sequential run.')
    print(f'Synchoronous reading took {read_duration1}s')
    print('\n' * 3)

结果

Starting sequential run.
Reading file: C:\Users\userA\Documents\k80.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k81.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k82.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k83.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k84.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k85.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k86.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k87.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k88.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k89.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k90.json
The JSON file size is 5
Ending sequential run.
Synchoronous reading took 5.07s

接下来，我尝试使用带有map function 调用的ThreadPoolExecutor运行它，使用 12 个线程。 这花了5.69秒。

from concurrent.futures import ThreadPoolExecutor
from json import load
from os.path import join
from time import time


def read_config(fname):
    json_fp = open(fname)
    json_data = load(json_fp)
    json_fp.close()

    return len(json_data)


if __name__ == '__main__':
    NUM_THREADS = 12
    idx = 0
    in_files = [join('C:\\', 'Users', 'userA', 'Documents', f'k{idx}.json') for idx in range(80, 91)]

    th_pool = ThreadPoolExecutor(max_workers=NUM_THREADS)
    print(f'Starting mapped pre-emptive threaded pool run with {NUM_THREADS} threads.')
    start_time2 = time()

    with th_pool:
        map_iter = th_pool.map(read_config, in_files, timeout=10)

    read_duration2 = round(time() - start_time2, 2)

    print('The JSON file size is ')
    map_results = list(map_iter)
    for map_res in map_results:
        print(f'The JSON file size is {map_res}')

    print('Ending mapped pre-emptive threaded pool run.')
    print(f'Mapped asynchoronous pre-emptive threaded pool reading took {read_duration2}s')
    print('\n' * 3)

结果

Starting mapped pre-emptive threaded pool run with 12 threads.
The JSON file size is
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending mapped pre-emptive threaded pool run.
Mapped asynchoronous pre-emptive threaded pool reading took 5.69s

最后，我尝试使用带有submit function 调用的ThreadPoolExecutor运行它，使用 12 个线程。 这花了5.73秒。

from concurrent.futures import ThreadPoolExecutor
from json import load
from os.path import join
from time import time


def read_config(fname):
    json_fp = open(fname)
    json_data = load(json_fp)
    json_fp.close()

    return len(json_data)


if __name__ == '__main__':
    NUM_THREADS = 12
    idx = 0
    in_files = [join('C:\\', 'Users', 'userA', 'Documents', f'k{idx}.json') for idx in range(80, 91)]

    th_pool = ThreadPoolExecutor(max_workers=NUM_THREADS)
    results = []
    print(f'Starting submitted pre-emptive threaded pool run with {NUM_THREADS} threads.')
    start_time3 = time()

    with th_pool:
        for fname in in_files:
            results.append(th_pool.submit(read_config, fname))

    read_duration3 = round(time() - start_time3, 2)

    for result in results:
        print(f'The JSON file size is {result.result(timeout=10)}')

    print('Ending submitted pre-emptive threaded pool run.')
    print(f'Submitted asynchoronous pre-emptive threaded pool reading took {read_duration3}s')

结果

Starting submitted pre-emptive threaded pool run with 12 threads.
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending submitted pre-emptive threaded pool run.
Submitted asynchoronous pre-emptive threaded pool reading took 5.73s

问题

为什么在读取像这样的相当大的 JSON 文件时，同步读取的执行速度比线程快？ 考虑到文件大小和正在读取的文件数量，我期望线程更快。
JSON 文件的大小是否比线程执行比同步读取更好所需的文件大得多？ 如果不是，还需要考虑哪些其他因素？

我提前感谢您的时间和帮助。

后记

感谢下面的答案，我稍微更改了read_config方法以引入 3s 睡眠延迟（模拟 IO 等待操作），现在线程版本真的很出色（ 38.81s vs 9.36s和9.39s ）。

def read_config(fname):
    json_fp = open(fname)
    json_data = load(json_fp)
    json_fp.close()

    sleep(3) # Simulate an activity that waits on I/O.

    return len(json_data)

结果

Starting sequential run.
Reading file: C:\Users\userA\Documents\k80.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k81.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k82.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k83.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k84.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k85.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k86.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k87.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k88.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k89.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k90.json
The JSON file size is 5
Ending sequential run.
Synchoronous reading took 38.81s




Starting mapped pre-emptive threaded pool run with 12 threads.
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending mapped pre-emptive threaded pool run.
Mapped asynchoronous pre-emptive threaded pool reading took 9.36s




Starting submitted pre-emptive threaded pool run with 12 threads.
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending submitted pre-emptive threaded pool run.
Submitted asynchoronous pre-emptive threaded pool reading took 9.39s

Answer 1

我不是专家，但一般来说，线程对于提高程序速度很有用，需要等待 IO。 线程不允许您访问并行 CPU 线程，它只允许操作并行运行，共享相同的 CPU 时间和 Python 解释器（如果您想访问更多 CPU，您应该查看 ProcessPoolExecutor）

例如，如果您从多个远程数据库而不是本地文件读取，那么您的程序将有很多时间等待 IO 而没有使用本地资源。 在这种情况下，线程可能会有所帮助，因为您可以并行等待，或者在等待另一个项目的同时处理一个项目。 但是，由于您的所有数据都来自本地文件，您可能已经在最大化您的本地磁盘 IO，您不能同时读取多个文件（或者至少不会比顺序读取它们快）。 您的机器仍然必须使用相同的资源完成所有相同的任务，这两种变体都没有任何“停机时间”，这就是它们花费几乎相同的时间的原因。

Answer 2

在这种情况下，您的任务受处理器限制而不是 I/O 限制 - 您的 CPU 可以以固定速率循环遍历数据。 如果您跨多个线程分割任务，它仍然会花费相同的时间，因为您的处理器只会一次增量地处理每个块（跨不同线程）。 您获得加速的唯一时间是手头的任务是 IO 绑定 - 例如，如果您尝试从需要很长时间的网站获取数据（与如果您的 CPU 已经拥有数据，则它可以处理该数据的速率）。
有关更多解释，请参阅我对另一个问题的SO 回答。
在理想的世界中，在您的情况下，多线程将花费与串行执行计算相同的时间。 然而，在实践中，实际分解每个任务，将其分配给一个线程，等待每个结果，然后最后将每个结果拼接在一起，将最终的 output 返回给您，需要资源和时间。 所有这些加起来使您在并行化 output 中看到的运行时间增加了大约 0.6 秒。
较大的 JSON 文件不会加速多线程。 为此，您的 JSON 文件必须托管在网站上。 让我们分解一下在串行和并行情况下会发生什么：
系列案例
由于网站的数据传输速度相对于您的 CPU 的数据处理速度来说比较慢，因此您的 CPU 会一直闲置等待数据可用。 它看起来像这样：
- CPU 向网站请求JSON_file1.json
- 网站发送JSON_file1.json
- CPU 处理/完成处理JSON_file1.json
- CPU 向网站请求JSON_file2.json ...
- 并重复直到处理完每个文件
并联案例
由于网站的数据传输速度相对于您的 CPU 的数据处理速度来说比较慢，因此您的 CPU 会一直闲置等待数据可用。 因此，如果您跨线程分配每个任务，您可以几乎同时（跨多个线程）启动每个 JSON_file 请求。
- Threads 1 - 4 make request to website for JSON_file1.json , JSON_file2.json , JSON_file3.json , JSON_file4.json
- 网站开始发送四个请求的JSON_file中的每一个
- 在接收到每个JSON_file时，CPU 会处理每个文件。
- 任务完成后，线程关闭并返回任何计算
- 如果还有更多文件要处理，则启动另一个线程以重复上述操作，直到处理完所有文件。

在中等大小的 JSON 文件上使用线程池进行同步读取比异步读取更快

问题描述

2 个解决方案

解决方案1
2 2019-09-24 21:08:31

解决方案2
1 已采纳 2019-09-24 21:11:29

在中等大小的 JSON 文件上使用线程池进行同步读取比异步读取更快

问题描述

2 个解决方案

解决方案1 2 2019-09-24 21:08:31

解决方案2 1 已采纳 2019-09-24 21:11:29

解决方案1
2 2019-09-24 21:08:31

解决方案2
1 已采纳 2019-09-24 21:11:29