简体   繁体   中英

Multiprocessing and Threading in Python

i'm trying to handle multiprocessing in python, however, i think i might did not understand it properly.

To start with, i have dataframe, which contains texts as string, on which i want to perform some regex. The code looks as follows:

import multiprocess 
from threading import Thread

def clean_qa():
    for index, row in data.iterrows():
        data["qa"].loc[index] = re.sub("(\-{5,}).{1,100}(\-{5,})|(\[.{1,50}\])|[^\w\s]", "",  str(data["qa"].loc[index]))

if __name__ == '__main__':
    threads = []
    
    for i in range(os.cpu_count()):
        threads.append(Thread(target=test_qa))
        
    for thread in threads:
        thread.start()
        
    for thread in threads:
        thread.join()

if __name__ == '__main__':
    processes = []

    for i in range(os.cpu_count()):
        processes.append(multiprocess.Process(target=test_qa))
        
    for process in processes:
        process.start()
        
    for process in processes:
        process.join()
    

When i run the function "clean_qa" not as function but simply by executing the for loop, everything works fine and it takes about 3 minutes.

However, when i use multiprocessing or threading, first of all, the execution takes about 10 minutes, and the text is not cleaned, so the dataframe is as before.

Therefore my question, what did i do wrong, why does it take longer and why does nothing happen to the dataframe?

Thank you very much!

This is slightly beside the point (though my comments in the original post do address the actual points), but since you're working with a Pandas dataframe, you really never want to loop over it by hand.

Looks like all you actually want here is just:

r = re.compile(r"(\-{5,}).{1,100}(\-{5,})|(\[.{1,50}\])|[^\w\s]")

def clean_qa():
    data["qa"] = data["qa"].str.replace(r, "")

to let Pandas deal with the looping and parallelization.

Answering about Threading, in answer to this question there's a python 3.9 example:

#example from the page below by Xiddoc
from threading import Thread
from time import sleep

# Here is a function that uses the sleep() function. If you called this directly, it would stop the main Python execution
def my_independent_function():
    print("Starting to sleep...")
    sleep(10)
    print("Finished sleeping.")

# Make a new thread which will run this function
t = Thread(target=my_independent_function)
# Start it in parallel
t.start()

# You can see that we can still execute other code, while other function is running
for i in range(5):
    print(i)
    sleep(1)

(Taken from this question: Can I run a coroutine in python independently from all other code? )

And you probably shouldn't try using Threading and multiprocessing simultaneously.

If you'd like to read more general information about multiprocessing\threading in python, you can see this post: How can I use threading in Python?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM