Python 中的多处理和线程

Question

i'm trying to handle multiprocessing in python, however, i think i might did not understand it properly.我正在尝试在 python 中处理多处理，但是，我想我可能没有正确理解它。

To start with, i have dataframe, which contains texts as string, on which i want to perform some regex.首先，我有 dataframe，它包含作为字符串的文本，我想在其上执行一些正则表达式。 The code looks as follows:代码如下所示：

import multiprocess 
from threading import Thread

def clean_qa():
    for index, row in data.iterrows():
        data["qa"].loc[index] = re.sub("(\-{5,}).{1,100}(\-{5,})|(\[.{1,50}\])|[^\w\s]", "",  str(data["qa"].loc[index]))

if __name__ == '__main__':
    threads = []
    
    for i in range(os.cpu_count()):
        threads.append(Thread(target=test_qa))
        
    for thread in threads:
        thread.start()
        
    for thread in threads:
        thread.join()

if __name__ == '__main__':
    processes = []

    for i in range(os.cpu_count()):
        processes.append(multiprocess.Process(target=test_qa))
        
    for process in processes:
        process.start()
        
    for process in processes:
        process.join()

When i run the function "clean_qa" not as function but simply by executing the for loop, everything works fine and it takes about 3 minutes.当我运行 function “clean_qa”而不是 function 时，只需执行 for 循环，一切正常，大约需要 3 分钟。

However, when i use multiprocessing or threading, first of all, the execution takes about 10 minutes, and the text is not cleaned, so the dataframe is as before.但是，当我使用多处理或线程时，首先执行大约需要 10 分钟，并且没有清理文本，所以 dataframe 和以前一样。

Therefore my question, what did i do wrong, why does it take longer and why does nothing happen to the dataframe?因此我的问题是，我做错了什么，为什么需要更长的时间，为什么 dataframe 没有发生任何事情？

Thank you very much!非常感谢！

Answer 1

This is slightly beside the point (though my comments in the original post do address the actual points), but since you're working with a Pandas dataframe, you really never want to loop over it by hand.这有点离题（尽管我在原始帖子中的评论确实解决了实际问题），但是由于您使用的是 Pandas dataframe，因此您真的永远不想手动遍历它。

Looks like all you actually want here is just:看起来你真正想要的只是：

r = re.compile(r"(\-{5,}).{1,100}(\-{5,})|(\[.{1,50}\])|[^\w\s]")

def clean_qa():
    data["qa"] = data["qa"].str.replace(r, "")

to let Pandas deal with the looping and parallelization.让 Pandas 处理循环和并行化。

Answer 2

Answering about Threading, in answer to this question there's a python 3.9 example:回答关于线程，在回答这个问题时，有一个 python 3.9 示例：

#example from the page below by Xiddoc
from threading import Thread
from time import sleep

# Here is a function that uses the sleep() function. If you called this directly, it would stop the main Python execution
def my_independent_function():
    print("Starting to sleep...")
    sleep(10)
    print("Finished sleeping.")

# Make a new thread which will run this function
t = Thread(target=my_independent_function)
# Start it in parallel
t.start()

# You can see that we can still execute other code, while other function is running
for i in range(5):
    print(i)
    sleep(1)

(Taken from this question: Can I run a coroutine in python independently from all other code? ) （取自这个问题：我可以在 python 中独立于所有其他代码运行协程吗？）

And you probably shouldn't try using Threading and multiprocessing simultaneously.而且您可能不应该尝试同时使用线程和多处理。

If you'd like to read more general information about multiprocessing\threading in python, you can see this post: How can I use threading in Python?如果您想阅读有关 python 中的多处理\线程的更多一般信息，您可以查看这篇文章：如何在 Python 中使用线程？

Python 中的多处理和线程

问题描述

1 个解决方案

解决方案1
2 2022-01-14 12:00:43

解决方案2
0 2022-01-14 12:13:53

Python 中的多处理和线程

问题描述

1 个解决方案

解决方案1 2 2022-01-14 12:00:43

解决方案2 0 2022-01-14 12:13:53

解决方案1
2 2022-01-14 12:00:43

解决方案2
0 2022-01-14 12:13:53