简体   繁体   English

Python 多线程按顺序执行线程 Pandas Dataframe

[英]Python Multithreading Execute Threads In Order with Pandas Dataframe

Is there a way to execute threads in a specific order ?有没有办法按特定顺序执行线程 I am familiar with the commands ".wait()" and ".notifyAll()", however it does not seem to work when all threads are targeting to a single function.我熟悉命令“.wait()”和“.notifyAll()”,但是当所有线程都针对单个 function 时,它似乎不起作用。 The code below should write the csv file in this order: df1, df2, df3, df4.下面的代码应该按以下顺序编写 csv 文件:df1、df2、df3、df4。


import threading
import pandas as pd


df1 = pd.DataFrame(columns=["col1","col2","col3"])
df2 = pd.DataFrame(columns=["col1","col2","col3"])
df3 = pd.DataFrame(columns=["col1","col2","col3"])
df4 = pd.DataFrame(columns=["col1","col2","col3"])


def function(df):
    ###webscraping, compile web data to dataframe
    df.to_csv('output.csv', mode='a')


if __name__ == '__main__':
    t1 = threading.Thread(target=function, args=(df1,))
    t2 = threading.Thread(target=function, args=(df2,))
    t3 = threading.Thread(target=function, args=(df3,))
    t4 = threading.Thread(target=function, args=(df4,))
    t1.start()
    t2.start()
    t3.start()
    t4.start()

I want all dataframes to wait inside "function()" until they can execute in order .我希望所有数据帧都在“function()”内等待,直到它们可以按顺序执行 With multithreading, threads like to "race each other" and can fall out of order executed .使用多线程,线程喜欢“相互竞争”并且可能会出现乱序执行 Although multithreading is a good performance enhancing tool, its' downfall comes into play when order matters.尽管多线程是一种很好的性能增强工具,但当顺序很重要时,它的衰落就会发挥作用。

Example of Simplicity: If thread 4 finishes compiling its' dataframe, it needs to wait for the first 3 threads to compile its' corresponding dataframe and upload to the csv file until thread 4 can upload.简单的例子:如果线程4编译完它的'dataframe,它需要等待前3个线程编译它的'对应的dataframe并上传到Z628CB5675FF524F3E719B7AA2Z8线程文件才能上传。

As always, thanks in advance!!一如既往,提前谢谢!!

To solve your problem in a clean way, you probably want to be using concurrent.futures instead of threading , hopefully you're on python3.2+.为了以干净的方式解决您的问题,您可能希望使用concurrent.futures而不是threading ,希望您使用的是 python3.2+。

To do so, you want to create a list of your arguments to the function in the order you need them to write arglist = [df1, df2, ...] , and then do something like为此,您需要按照您需要它们编写arglist = [df1, df2, ...]的顺序创建 arguments 到 function 的列表,然后执行类似的操作

from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=len(arglist)) as ex:
    results = ex.map(function, arglist)
for res in results:
    res.to_csv(..., mode='a')

To be honest, you should really try to use concurrent.futures for everything related to threading or multiprocessing.老实说,您真的应该尝试将concurrent.futures用于与线程或多处理相关的所有事情。

It appears I read the question wrong the first time.看来我第一次读错了这个问题。 I'll leave my previous answer here for people who google我将把我之前的答案留给谷歌的人

You can use a lock (see https://docs.python.org/3/library/threading.html#lock-objects ), and then call lock.acquire() before writing to csv and then lock.release() afterwards.您可以使用锁(请参阅https://docs.python.org/3/library/threading.html#lock-objects ),然后在写入 Z628CB5675FF524F3E719B7AA2 之前调用lock.acquire() lock.release() . This will do exactly what you want.这将完全符合您的要求。

Although in my opinion this is not ideal, instead I would suggest returning the dataframes from each thread and just writing them all at the end.尽管在我看来这并不理想,但我建议从每个线程返回数据帧并在最后将它们全部写入。

Your code would simply look like ``` lock = threading.Lock()您的代码看起来就像 ``` lock = threading.Lock()

def function(args): # web stuff with lock: df.to_csv(...) ``` def function(args): # web 带锁的东西:df.to_csv(...) ```

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM