简体   繁体   中英

Python 3.8 Convert for loop to multiprocessing/multithreading

I am new to multiprocessing I would really appreciate it if someone can guide/help me here. I have the following for loop which gets some data from the two functions. The code looks like this

    for a in accounts:
        dl_users[a['Email']] = get_dl_users(a['Email'], adConn)
        group_users[a['Email']] = get_group_users(a['Id'], adConn)

    print(f"Users part of DL - {dl_users}")
    print(f"Users part of groups - {group_users}")
    adConn.unbind()

This works fine and gets all the results but recently I have noticed it takes a lot of time to get the list of users ie dl_users and group_users. It takes almost 14-15 mins to complete. I am looking for ways where I can speed up the function and would like to convert this for loop to multiprocessing. get_group_users and get_dl_users makes calls for LDAP, so I am not 100% sure if I should be converting this to multiprocessing or multithreading. Any suggestion would be of big help

As mentioned in the comments, multithreading is appropriate for I/O operations (reading/writing from/to files, sending http requests, communicating with databases), while multiprocessing is appropriate for CPU-bound tasks (such as transforming data, making calculations...). Depending on which kind of operation your functions perform, you want one or the other. If they do a mix, separate them internally and profile which of the two really needs optimisation, since both multiprocessing and -threading introduce overhead that might not be worth adding.

That said, the way to apply multiprocessing or multithreading is pretty simple in recent Python versions (including your 3.8).

Multiprocessing

from multiprocessing import Pool


# Pick the amount of processes that works best for you
processes = 4

with Pool(processes) as pool:
    processed = pool.map(your_func, your_data)

Where your_func is a function to apply to each element of your_data , which is an iterable. If you need to provide some other parameters to the callable, you can use a lambda function:

processed = pool.map(lambda item: your_func(item, some_kwarg="some value"), your_data)

Multithreading

The API for multithreading is very similar:

from concurrent.futures import ThreadPoolExecutor


# Pick the amount of workers that works best for you.
# Most likely equal to the amount of threads of your machine.
workers = 4

with ThreadPoolExecutor(workers) as pool:
    processed = pool.map(your_func, your_data)

If you want to avoid having to store your_data in memory if you need some attribute of the items instead of the items itself, you can use a generator:

processed = pool.map(your_func, (account["Email"] for account in accounts))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM