简体   繁体   中英

Fastest way to call a function millions of times in Python

I have a function readFiles that I need to call 8.5 million times (essentially stress-testing a logger to ensure the log rotates correctly). I don't care about the output/result of the function, only that I run it N times as quickly as possible.

My current solution is this:

from threading import Thread
import subprocess

def readFile(filename):
    args = ["/usr/bin/ls", filename]
    subprocess.run(args)

def main():
    filename = "test.log"
    threads = set()
    for i in range(8500000):
        thread = Thread(target=readFile, args=(filename,)
        thread.start()
        threads.add(thread)
    
    # Wait for all the reads to finish
    while len(threads):
        # Avoid changing size of set while iterating
        for thread in threads.copy():
            if not thread.is_alive():
                threads.remove(thread)

readFile has been simplified, but the concept is the same. I need to run readFile 8.5 million times, and I need to wait for all the reads to finish. Based on my mental math, this spawns ~60 threads per second, which means it will take ~40 hours to finish. Ideally, this would finish within 1-8 hours.

Is this possible? Is the number of iterations simply too high for this to be done in a reasonable span of time?

Oddly enough, when I wrote a test script, I was able to generate a thread about every ~0.0005 seconds, which should equate to ~2000 threads per second, but this is not the case here.

I considered iteration 8500000 / 10 times, and spawning a thread which then runs the readFile function 10 times, which should decrease the amount of time by ~90%, but it caused some issues with blocking resources, and I think passing a lock around would be a bit complicated insofar as keeping the function usable by methods that don't incorporate threading.

Any tips?

Based on @blarg's comment, and scripts I've used using multiprocessing, the following can be considered.

It simply reads the same file based on the size of the list. Here I'm looking at 1M reads.

With 1 core it takes around 50 seconds. With 8 cores it's down to around 22 seconds. this is on a windows PC, but I use these scripts on linux EC2 (AWS) instances as well.

just put this in a python file and run:

import os
import time
from multiprocessing import Pool
from itertools import repeat

def readfile(fn):
    f = open(fn, "r")

def _multiprocess(mylist, num_proc):
    with Pool(num_proc) as pool:
        r = pool.starmap(readfile, zip(mylist))
        pool.close()
        pool.join()
    return r

if __name__ == "__main__":
    __spec__=None

    # use the system cpus or change explicitly
    num_proc = os.cpu_count()
    num_proc = 1

    start = time.time()
    mylist = ["test.txt"]*1000000 # here you'll want to 8.5M, but test first that it works with smaller number. note this module is slow with low number of reads, meaning 8 cores is slower than 1 core until you reach a certain point, then multiprocessing is worth it
    rs = _multiprocess(mylist, num_proc=num_proc)
    print('total seconds,', time.time()-start )

I think you should considering using subprocess here, if you just want to execute ls command I think it's better to use os.system since it will reduce the resource consumption of your current GIL

also you have to put a little delay with time.sleep() while waiting the thread to be finished to reduce resource consumption

from threading import Thread
import os
import time

def readFile(filename):
    os.system("/usr/bin/ls "+filename)

def main():
    filename = "test.log"
    threads = set()
    for i in range(8500000):
        thread = Thread(target=readFile, args=(filename,)
        thread.start()
        threads.add(thread)
    
    # Wait for all the reads to finish
    while len(threads):
        time.sleep(0.1) # put this delay to reduce resource consumption while waiting
        # Avoid changing size of set while iterating
        for thread in threads.copy():
            if not thread.is_alive():
                threads.remove(thread)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM