简体   繁体   English

如何将tqdm与python多处理集成

[英]How to integrate tqdm with python multiprocessing

I am creating a new python class where I am trying to integrate multiprocessing as well as tqdm to illustrate progress. 我正在创建一个新的python类,在其中尝试集成多处理以及tqdm以说明进度。 I am going down this path because I am opening very large (>1GB) time series data files, loading into pandas, doing a groupby and then saving them in parquet format. 我之所以走这条路,是因为我要打开非常大(> 1GB)的时间序列数据文件,将其加载到熊猫中,进行分组,然后将它们保存为镶木地板格式。 Each datafile can take minutes to process and save. 每个数据文件可能需要几分钟来处理和保存。 Multiprocessing speeds up the process immensely. 多重处理极大地加快了处理速度。 However, I have no visibility currently on the process and I am trying to integrate tqdm. 但是,我目前对该过程没有任何了解,并且我正在尝试集成tqdm。

The code below illustrates a simple example. 下面的代码说明了一个简单的示例。 In this code tqdm just shows how long it takes the processes to be allocated to a pool, but does not update per the actual process. 在此代码中,tqdm仅显示将进程分配到池所花费的时间,但不会根据实际进程进行更新。

'''python '''蟒蛇

import time
import multiprocessing
from tqdm import tqdm


class test_multiprocessing(object):

    def __init__(self, *args, **kwargs):
        self.list_of_results=[]
        self.items = [0,1,2,3,4,5,6,7,8,9,10]


    def run_test(self):
        print(f'Startng test')

        for i in range(1,5,1):
            print(f'working on var1: {i}')

            p = multiprocessing.Pool()

            for j in tqdm(self.items, desc='Items', unit='items', disable=False):
                variable3=3.14159
                p.apply_async(self.worker, [i, j,variable3], callback=self.update)

            p.close()
            p.join()
            print(f'completed i = {i}')
            print(f'')

    def worker(self, var1, var2, var3):
        result=var1*var2*var3
        time.sleep(2)
        return result

    def update(self, result_to_save):
        self.list_of_results.append(result_to_save)

if __name__ == '__main__':
    test1=test_multiprocessing()
    test1.run_test()

''' '''

In this example the progress bar will show the work is complete almost immediately, but in reality it takes seconds 在此示例中,进度条将几乎立即显示工作已完成,但实际上需要几秒钟

I found a great solution to the problem by using concurrent.futures vs multiprocessing. 通过并发。未来与多处理,我找到了解决该问题的好方法。 Dan Shiebler wrote a good blog on this and has a good example http://danshiebler.com/2016-09-14-parallel-progress-bar/ Dan Shiebler对此写了一个不错的博客,并有一个很好的例子http://danshiebler.com/2016-09-14-parallel-progress-bar/

An implantation of this strategy is shown below, which solves the problem I posed earlier 下面显示了此策略的植入,它解决了我之前提出的问题

import time

from tqdm import tqdm
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed

class test_multiprocessing(object):

    def __init__(self, *args, **kwargs):
        self.list_of_results=[]
        self.items = [0,1,2,3,4,5,6,7,8,9,10]


    def run_test(self):
        print(f'Startng test')

        for i in range(1,5,1):
            print(f'working on var1: {i}')

            variable_list=[]

            for j in self.items:
                variable3=3.14159
                variables = [i,j,variable3]
                variable_list.append(variables)

            with ThreadPoolExecutor(max_workers=1000) as pool:   # with ProcessPoolExecutor(max_workers=n_jobs) as pool:    
                futures = [pool.submit(self.worker, a) for a in variable_list]
                kwargs = {
                'total': len(futures),
                'unit': 'it',
                'unit_scale': True,
                'leave': True
                }

                #Print out the progress as tasks complete
                for f in tqdm(as_completed(futures), **kwargs):
                    pass

            out = []
            #Get the results from the futures. 
            for i, future in tqdm(enumerate(futures)):
                try:
                    self.update(future.result())
                except Exception as e:
                    print(f'We have an error: {e}')


    def worker(self, variables):
        result=variables[0]*variables[1]*variables[2]
        time.sleep(2)
        return result


    def update(self, result_to_save):
        self.list_of_results.append(result_to_save)

if __name__ == '__main__':
    test1=test_multiprocessing()
    test1.run_test()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM