[英]Python Multithreading Execute Threads In Order with Pandas Dataframe
Is there a way to execute threads in a specific order ?有没有办法按特定顺序执行线程? I am familiar with the commands ".wait()" and ".notifyAll()", however it does not seem to work when all threads are targeting to a single function.
我熟悉命令“.wait()”和“.notifyAll()”,但是当所有线程都针对单个 function 时,它似乎不起作用。 The code below should write the csv file in this order: df1, df2, df3, df4.
下面的代码应该按以下顺序编写 csv 文件:df1、df2、df3、df4。
import threading
import pandas as pd
df1 = pd.DataFrame(columns=["col1","col2","col3"])
df2 = pd.DataFrame(columns=["col1","col2","col3"])
df3 = pd.DataFrame(columns=["col1","col2","col3"])
df4 = pd.DataFrame(columns=["col1","col2","col3"])
def function(df):
###webscraping, compile web data to dataframe
df.to_csv('output.csv', mode='a')
if __name__ == '__main__':
t1 = threading.Thread(target=function, args=(df1,))
t2 = threading.Thread(target=function, args=(df2,))
t3 = threading.Thread(target=function, args=(df3,))
t4 = threading.Thread(target=function, args=(df4,))
t1.start()
t2.start()
t3.start()
t4.start()
I want all dataframes to wait inside "function()" until they can execute in order .我希望所有数据帧都在“function()”内等待,直到它们可以按顺序执行。 With multithreading, threads like to "race each other" and can fall out of order executed .
使用多线程,线程喜欢“相互竞争”并且可能会出现乱序执行。 Although multithreading is a good performance enhancing tool, its' downfall comes into play when order matters.
尽管多线程是一种很好的性能增强工具,但当顺序很重要时,它的衰落就会发挥作用。
Example of Simplicity: If thread 4 finishes compiling its' dataframe, it needs to wait for the first 3 threads to compile its' corresponding dataframe and upload to the csv file until thread 4 can upload.简单的例子:如果线程4编译完它的'dataframe,它需要等待前3个线程编译它的'对应的dataframe并上传到Z628CB5675FF524F3E719B7AA2Z8线程文件才能上传。
As always, thanks in advance!!一如既往,提前谢谢!!
To solve your problem in a clean way, you probably want to be using concurrent.futures
instead of threading
, hopefully you're on python3.2+.为了以干净的方式解决您的问题,您可能希望使用
concurrent.futures
而不是threading
,希望您使用的是 python3.2+。
To do so, you want to create a list of your arguments to the function in the order you need them to write arglist = [df1, df2, ...]
, and then do something like为此,您需要按照您需要它们编写
arglist = [df1, df2, ...]
的顺序创建 arguments 到 function 的列表,然后执行类似的操作
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=len(arglist)) as ex:
results = ex.map(function, arglist)
for res in results:
res.to_csv(..., mode='a')
To be honest, you should really try to use concurrent.futures
for everything related to threading or multiprocessing.老实说,您真的应该尝试将
concurrent.futures
用于与线程或多处理相关的所有事情。
It appears I read the question wrong the first time.看来我第一次读错了这个问题。 I'll leave my previous answer here for people who google
我将把我之前的答案留给谷歌的人
You can use a lock (see https://docs.python.org/3/library/threading.html#lock-objects ), and then call
lock.acquire()
before writing to csv and thenlock.release()
afterwards.您可以使用锁(请参阅https://docs.python.org/3/library/threading.html#lock-objects ),然后在写入 Z628CB5675FF524F3E719B7AA2 之前调用
lock.acquire()
lock.release()
. This will do exactly what you want.这将完全符合您的要求。
Although in my opinion this is not ideal, instead I would suggest returning the dataframes from each thread and just writing them all at the end.
尽管在我看来这并不理想,但我建议从每个线程返回数据帧并在最后将它们全部写入。
Your code would simply look like ``` lock = threading.Lock()
您的代码看起来就像 ``` lock = threading.Lock()
def function(args): # web stuff with lock: df.to_csv(...) ```
def function(args): # web 带锁的东西:df.to_csv(...) ```
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.