在中等大小的 JSON 文件上使用線程池進行同步讀取比異步讀取更快

Question

異步比同步慢的答案沒有涵蓋我正在處理的場景，因此這個問題。

I'm using Python 3.6.0 on Windows 10 to read 11 identical JSON files named k80.json to k90.json , 18.1 MB each.

首先，我嘗試同步、順序讀取所有 11 個文件。 完成時間為5.07秒。

from json import load
from os.path import join
from time import time


def read_config(fname):
    json_fp = open(fname)
    json_data = load(json_fp)
    json_fp.close()

    return len(json_data)


if __name__ == '__main__':
    NUM_THREADS = 12
    idx = 0
    in_files = [join('C:\\', 'Users', 'userA', 'Documents', f'k{idx}.json') for idx in range(80, 91)]

    print('Starting sequential run.')
    start_time1 = time()

    for fname in in_files:
        print(f'Reading file: {fname}')
        print(f'The JSON file size is {read_config(fname)}')

    read_duration1 = round(time() - start_time1, 2)

    print('Ending sequential run.')
    print(f'Synchoronous reading took {read_duration1}s')
    print('\n' * 3)

結果

Starting sequential run.
Reading file: C:\Users\userA\Documents\k80.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k81.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k82.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k83.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k84.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k85.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k86.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k87.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k88.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k89.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k90.json
The JSON file size is 5
Ending sequential run.
Synchoronous reading took 5.07s

接下來，我嘗試使用帶有map function 調用的ThreadPoolExecutor運行它，使用 12 個線程。 這花了5.69秒。

from concurrent.futures import ThreadPoolExecutor
from json import load
from os.path import join
from time import time


def read_config(fname):
    json_fp = open(fname)
    json_data = load(json_fp)
    json_fp.close()

    return len(json_data)


if __name__ == '__main__':
    NUM_THREADS = 12
    idx = 0
    in_files = [join('C:\\', 'Users', 'userA', 'Documents', f'k{idx}.json') for idx in range(80, 91)]

    th_pool = ThreadPoolExecutor(max_workers=NUM_THREADS)
    print(f'Starting mapped pre-emptive threaded pool run with {NUM_THREADS} threads.')
    start_time2 = time()

    with th_pool:
        map_iter = th_pool.map(read_config, in_files, timeout=10)

    read_duration2 = round(time() - start_time2, 2)

    print('The JSON file size is ')
    map_results = list(map_iter)
    for map_res in map_results:
        print(f'The JSON file size is {map_res}')

    print('Ending mapped pre-emptive threaded pool run.')
    print(f'Mapped asynchoronous pre-emptive threaded pool reading took {read_duration2}s')
    print('\n' * 3)

結果

Starting mapped pre-emptive threaded pool run with 12 threads.
The JSON file size is
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending mapped pre-emptive threaded pool run.
Mapped asynchoronous pre-emptive threaded pool reading took 5.69s

最后，我嘗試使用帶有submit function 調用的ThreadPoolExecutor運行它，使用 12 個線程。 這花了5.73秒。

from concurrent.futures import ThreadPoolExecutor
from json import load
from os.path import join
from time import time


def read_config(fname):
    json_fp = open(fname)
    json_data = load(json_fp)
    json_fp.close()

    return len(json_data)


if __name__ == '__main__':
    NUM_THREADS = 12
    idx = 0
    in_files = [join('C:\\', 'Users', 'userA', 'Documents', f'k{idx}.json') for idx in range(80, 91)]

    th_pool = ThreadPoolExecutor(max_workers=NUM_THREADS)
    results = []
    print(f'Starting submitted pre-emptive threaded pool run with {NUM_THREADS} threads.')
    start_time3 = time()

    with th_pool:
        for fname in in_files:
            results.append(th_pool.submit(read_config, fname))

    read_duration3 = round(time() - start_time3, 2)

    for result in results:
        print(f'The JSON file size is {result.result(timeout=10)}')

    print('Ending submitted pre-emptive threaded pool run.')
    print(f'Submitted asynchoronous pre-emptive threaded pool reading took {read_duration3}s')

結果

Starting submitted pre-emptive threaded pool run with 12 threads.
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending submitted pre-emptive threaded pool run.
Submitted asynchoronous pre-emptive threaded pool reading took 5.73s

問題

為什么在讀取像這樣的相當大的 JSON 文件時，同步讀取的執行速度比線程快？ 考慮到文件大小和正在讀取的文件數量，我期望線程更快。
JSON 文件的大小是否比線程執行比同步讀取更好所需的文件大得多？ 如果不是，還需要考慮哪些其他因素？

我提前感謝您的時間和幫助。

后記

感謝下面的答案，我稍微更改了read_config方法以引入 3s 睡眠延遲（模擬 IO 等待操作），現在線程版本真的很出色（ 38.81s vs 9.36s和9.39s ）。

def read_config(fname):
    json_fp = open(fname)
    json_data = load(json_fp)
    json_fp.close()

    sleep(3) # Simulate an activity that waits on I/O.

    return len(json_data)

結果

Starting sequential run.
Reading file: C:\Users\userA\Documents\k80.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k81.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k82.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k83.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k84.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k85.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k86.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k87.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k88.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k89.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k90.json
The JSON file size is 5
Ending sequential run.
Synchoronous reading took 38.81s




Starting mapped pre-emptive threaded pool run with 12 threads.
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending mapped pre-emptive threaded pool run.
Mapped asynchoronous pre-emptive threaded pool reading took 9.36s




Starting submitted pre-emptive threaded pool run with 12 threads.
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending submitted pre-emptive threaded pool run.
Submitted asynchoronous pre-emptive threaded pool reading took 9.39s

Answer 1

我不是專家，但一般來說，線程對於提高程序速度很有用，需要等待 IO。 線程不允許您訪問並行 CPU 線程，它只允許操作並行運行，共享相同的 CPU 時間和 Python 解釋器（如果您想訪問更多 CPU，您應該查看 ProcessPoolExecutor）

例如，如果您從多個遠程數據庫而不是本地文件讀取，那么您的程序將有很多時間等待 IO 而沒有使用本地資源。 在這種情況下，線程可能會有所幫助，因為您可以並行等待，或者在等待另一個項目的同時處理一個項目。 但是，由於您的所有數據都來自本地文件，您可能已經在最大化您的本地磁盤 IO，您不能同時讀取多個文件（或者至少不會比順序讀取它們快）。 您的機器仍然必須使用相同的資源完成所有相同的任務，這兩種變體都沒有任何“停機時間”，這就是它們花費幾乎相同的時間的原因。

Answer 2

在這種情況下，您的任務受處理器限制而不是 I/O 限制 - 您的 CPU 可以以固定速率循環遍歷數據。 如果您跨多個線程分割任務，它仍然會花費相同的時間，因為您的處理器只會一次增量地處理每個塊（跨不同線程）。 您獲得加速的唯一時間是手頭的任務是 IO 綁定 - 例如，如果您嘗試從需要很長時間的網站獲取數據（與如果您的 CPU 已經擁有數據，則它可以處理該數據的速率）。
有關更多解釋，請參閱我對另一個問題的SO 回答。
在理想的世界中，在您的情況下，多線程將花費與串行執行計算相同的時間。 然而，在實踐中，實際分解每個任務，將其分配給一個線程，等待每個結果，然后最后將每個結果拼接在一起，將最終的 output 返回給您，需要資源和時間。 所有這些加起來使您在並行化 output 中看到的運行時間增加了大約 0.6 秒。
較大的 JSON 文件不會加速多線程。 為此，您的 JSON 文件必須托管在網站上。 讓我們分解一下在串行和並行情況下會發生什么：
系列案例
由於網站的數據傳輸速度相對於您的 CPU 的數據處理速度來說比較慢，因此您的 CPU 會一直閑置等待數據可用。 它看起來像這樣：
- CPU 向網站請求JSON_file1.json
- 網站發送JSON_file1.json
- CPU 處理/完成處理JSON_file1.json
- CPU 向網站請求JSON_file2.json ...
- 並重復直到處理完每個文件
並聯案例
由於網站的數據傳輸速度相對於您的 CPU 的數據處理速度來說比較慢，因此您的 CPU 會一直閑置等待數據可用。 因此，如果您跨線程分配每個任務，您可以幾乎同時（跨多個線程）啟動每個 JSON_file 請求。
- Threads 1 - 4 make request to website for JSON_file1.json , JSON_file2.json , JSON_file3.json , JSON_file4.json
- 網站開始發送四個請求的JSON_file中的每一個
- 在接收到每個JSON_file時，CPU 會處理每個文件。
- 任務完成后，線程關閉並返回任何計算
- 如果還有更多文件要處理，則啟動另一個線程以重復上述操作，直到處理完所有文件。

在中等大小的 JSON 文件上使用線程池進行同步讀取比異步讀取更快

問題描述

2 個解決方案

解決方案1
2 2019-09-24 21:08:31

解決方案2
1 已采納 2019-09-24 21:11:29

在中等大小的 JSON 文件上使用線程池進行同步讀取比異步讀取更快

問題描述

2 個解決方案

解決方案1 2 2019-09-24 21:08:31

解決方案2 1 已采納 2019-09-24 21:11:29

解決方案1
2 2019-09-24 21:08:31

解決方案2
1 已采納 2019-09-24 21:11:29