在 Python 中調用 function 數百萬次的最快方法

Question

我有一個 function readFiles ，我需要調用它 850 萬次（本質上是對記錄器進行壓力測試以確保日志正確輪換）。 我不關心 function 的輸出/結果，只關心我盡可能快地運行它 N 次。

我目前的解決方案是這樣的：

from threading import Thread
import subprocess

def readFile(filename):
    args = ["/usr/bin/ls", filename]
    subprocess.run(args)

def main():
    filename = "test.log"
    threads = set()
    for i in range(8500000):
        thread = Thread(target=readFile, args=(filename,)
        thread.start()
        threads.add(thread)
    
    # Wait for all the reads to finish
    while len(threads):
        # Avoid changing size of set while iterating
        for thread in threads.copy():
            if not thread.is_alive():
                threads.remove(thread)

readFile已被簡化，但概念是相同的。 我需要運行 readFile 850 萬次，並且需要等待所有讀取完成。 根據我的心算，這每秒產生約 60 個線程，這意味着需要約 40 個小時才能完成。 理想情況下，這將在 1-8 小時內完成。

這可能嗎？ 迭代次數是否太高以至於無法在合理的時間內完成？

奇怪的是，當我編寫測試腳本時，我能夠大約每 ~0.0005 秒生成一個線程，這應該相當於每秒約 2000 個線程，但這里不是這種情況。

我考慮迭代 8500000 / 10 次，並生成一個線程，然后運行 readFile function 10 次，這應該減少 ~90% 的時間量，但它導致了一些阻塞資源的問題，我認為傳遞一個鎖會就保持 function 可用於不包含線程的方法而言，這有點復雜。

有小費嗎？

Answer 1

根據@blarg 的評論和我使用多處理使用的腳本，可以考慮以下內容。

它只是根據列表的大小讀取相同的文件。 在這里，我正在查看 1M 讀取。

使用 1 個核心大約需要 50 秒。 對於 8 個內核，它可以縮短到 22 秒左右。 這是在 windows PC 上，但我也在 linux EC2 (AWS) 實例上使用這些腳本。

只需將其放入 python 文件並運行：

import os
import time
from multiprocessing import Pool
from itertools import repeat

def readfile(fn):
    f = open(fn, "r")

def _multiprocess(mylist, num_proc):
    with Pool(num_proc) as pool:
        r = pool.starmap(readfile, zip(mylist))
        pool.close()
        pool.join()
    return r

if __name__ == "__main__":
    __spec__=None

    # use the system cpus or change explicitly
    num_proc = os.cpu_count()
    num_proc = 1

    start = time.time()
    mylist = ["test.txt"]*1000000 # here you'll want to 8.5M, but test first that it works with smaller number. note this module is slow with low number of reads, meaning 8 cores is slower than 1 core until you reach a certain point, then multiprocessing is worth it
    rs = _multiprocess(mylist, num_proc=num_proc)
    print('total seconds,', time.time()-start )

Answer 2

我認為你應該考慮在這里使用subprocess進程，如果你只想執行ls命令我認為最好使用os.system因為它會減少你當前 GIL 的資源消耗

在等待線程完成以減少資源消耗時，您還必須對time.sleep()進行一點延遲

from threading import Thread
import os
import time

def readFile(filename):
    os.system("/usr/bin/ls "+filename)

def main():
    filename = "test.log"
    threads = set()
    for i in range(8500000):
        thread = Thread(target=readFile, args=(filename,)
        thread.start()
        threads.add(thread)
    
    # Wait for all the reads to finish
    while len(threads):
        time.sleep(0.1) # put this delay to reduce resource consumption while waiting
        # Avoid changing size of set while iterating
        for thread in threads.copy():
            if not thread.is_alive():
                threads.remove(thread)

在 Python 中調用 function 數百萬次的最快方法

問題描述

2 個解決方案

解決方案1
0 2022-03-27 02:48:47

解決方案2
0 2022-06-04 13:09:47

在 Python 中調用 function 數百萬次的最快方法

問題描述

2 個解決方案

解決方案1 0 2022-03-27 02:48:47

解決方案2 0 2022-06-04 13:09:47

解決方案1
0 2022-03-27 02:48:47

解決方案2
0 2022-06-04 13:09:47