在 Python 中调用 function 数百万次的最快方法

Question

我有一个 function readFiles ，我需要调用它 850 万次（本质上是对记录器进行压力测试以确保日志正确轮换）。 我不关心 function 的输出/结果，只关心我尽可能快地运行它 N 次。

我目前的解决方案是这样的：

from threading import Thread
import subprocess

def readFile(filename):
    args = ["/usr/bin/ls", filename]
    subprocess.run(args)

def main():
    filename = "test.log"
    threads = set()
    for i in range(8500000):
        thread = Thread(target=readFile, args=(filename,)
        thread.start()
        threads.add(thread)
    
    # Wait for all the reads to finish
    while len(threads):
        # Avoid changing size of set while iterating
        for thread in threads.copy():
            if not thread.is_alive():
                threads.remove(thread)

readFile已被简化，但概念是相同的。 我需要运行 readFile 850 万次，并且需要等待所有读取完成。 根据我的心算，这每秒产生约 60 个线程，这意味着需要约 40 个小时才能完成。 理想情况下，这将在 1-8 小时内完成。

这可能吗？ 迭代次数是否太高以至于无法在合理的时间内完成？

奇怪的是，当我编写测试脚本时，我能够大约每 ~0.0005 秒生成一个线程，这应该相当于每秒约 2000 个线程，但这里不是这种情况。

我考虑迭代 8500000 / 10 次，并生成一个线程，然后运行 readFile function 10 次，这应该减少 ~90% 的时间量，但它导致了一些阻塞资源的问题，我认为传递一个锁会就保持 function 可用于不包含线程的方法而言，这有点复杂。

有小费吗？

Answer 1

根据@blarg 的评论和我使用多处理使用的脚本，可以考虑以下内容。

它只是根据列表的大小读取相同的文件。 在这里，我正在查看 1M 读取。

使用 1 个核心大约需要 50 秒。 对于 8 个内核，它可以缩短到 22 秒左右。 这是在 windows PC 上，但我也在 linux EC2 (AWS) 实例上使用这些脚本。

只需将其放入 python 文件并运行：

import os
import time
from multiprocessing import Pool
from itertools import repeat

def readfile(fn):
    f = open(fn, "r")

def _multiprocess(mylist, num_proc):
    with Pool(num_proc) as pool:
        r = pool.starmap(readfile, zip(mylist))
        pool.close()
        pool.join()
    return r

if __name__ == "__main__":
    __spec__=None

    # use the system cpus or change explicitly
    num_proc = os.cpu_count()
    num_proc = 1

    start = time.time()
    mylist = ["test.txt"]*1000000 # here you'll want to 8.5M, but test first that it works with smaller number. note this module is slow with low number of reads, meaning 8 cores is slower than 1 core until you reach a certain point, then multiprocessing is worth it
    rs = _multiprocess(mylist, num_proc=num_proc)
    print('total seconds,', time.time()-start )

Answer 2

我认为你应该考虑在这里使用subprocess进程，如果你只想执行ls命令我认为最好使用os.system因为它会减少你当前 GIL 的资源消耗

在等待线程完成以减少资源消耗时，您还必须对time.sleep()进行一点延迟

from threading import Thread
import os
import time

def readFile(filename):
    os.system("/usr/bin/ls "+filename)

def main():
    filename = "test.log"
    threads = set()
    for i in range(8500000):
        thread = Thread(target=readFile, args=(filename,)
        thread.start()
        threads.add(thread)
    
    # Wait for all the reads to finish
    while len(threads):
        time.sleep(0.1) # put this delay to reduce resource consumption while waiting
        # Avoid changing size of set while iterating
        for thread in threads.copy():
            if not thread.is_alive():
                threads.remove(thread)

在 Python 中调用 function 数百万次的最快方法

问题描述

2 个解决方案

解决方案1
0 2022-03-27 02:48:47

解决方案2
0 2022-06-04 13:09:47

在 Python 中调用 function 数百万次的最快方法

问题描述

2 个解决方案

解决方案1 0 2022-03-27 02:48:47

解决方案2 0 2022-06-04 13:09:47

解决方案1
0 2022-03-27 02:48:47

解决方案2
0 2022-06-04 13:09:47