Multiprocessing and Threading in Python

Question

i'm trying to handle multiprocessing in python, however, i think i might did not understand it properly.

To start with, i have dataframe, which contains texts as string, on which i want to perform some regex. The code looks as follows:

import multiprocess 
from threading import Thread

def clean_qa():
    for index, row in data.iterrows():
        data["qa"].loc[index] = re.sub("(\-{5,}).{1,100}(\-{5,})|(\[.{1,50}\])|[^\w\s]", "",  str(data["qa"].loc[index]))

if __name__ == '__main__':
    threads = []
    
    for i in range(os.cpu_count()):
        threads.append(Thread(target=test_qa))
        
    for thread in threads:
        thread.start()
        
    for thread in threads:
        thread.join()

if __name__ == '__main__':
    processes = []

    for i in range(os.cpu_count()):
        processes.append(multiprocess.Process(target=test_qa))
        
    for process in processes:
        process.start()
        
    for process in processes:
        process.join()

When i run the function "clean_qa" not as function but simply by executing the for loop, everything works fine and it takes about 3 minutes.

However, when i use multiprocessing or threading, first of all, the execution takes about 10 minutes, and the text is not cleaned, so the dataframe is as before.

Therefore my question, what did i do wrong, why does it take longer and why does nothing happen to the dataframe?

Thank you very much!

Answer 1

This is slightly beside the point (though my comments in the original post do address the actual points), but since you're working with a Pandas dataframe, you really never want to loop over it by hand.

Looks like all you actually want here is just:

r = re.compile(r"(\-{5,}).{1,100}(\-{5,})|(\[.{1,50}\])|[^\w\s]")

def clean_qa():
    data["qa"] = data["qa"].str.replace(r, "")

to let Pandas deal with the looping and parallelization.

Answer 2

Answering about Threading, in answer to this question there's a python 3.9 example:

#example from the page below by Xiddoc
from threading import Thread
from time import sleep

# Here is a function that uses the sleep() function. If you called this directly, it would stop the main Python execution
def my_independent_function():
    print("Starting to sleep...")
    sleep(10)
    print("Finished sleeping.")

# Make a new thread which will run this function
t = Thread(target=my_independent_function)
# Start it in parallel
t.start()

# You can see that we can still execute other code, while other function is running
for i in range(5):
    print(i)
    sleep(1)

(Taken from this question: Can I run a coroutine in python independently from all other code? )

And you probably shouldn't try using Threading and multiprocessing simultaneously.

If you'd like to read more general information about multiprocessing\threading in python, you can see this post: How can I use threading in Python?

Multiprocessing and Threading in Python

Question

1 answers

solution1
2 2022-01-14 12:00:43

solution2
0 2022-01-14 12:13:53

Multiprocessing and Threading in Python

Question

1 answers

solution1 2 2022-01-14 12:00:43

solution2 0 2022-01-14 12:13:53

solution1
2 2022-01-14 12:00:43

solution2
0 2022-01-14 12:13:53