简体   繁体   English

Python-并发运行函数(多个实例)

[英]Python - Running function concurrently (multiple instance)

I built a little function that will gather some data using a 3rd party API. 我构建了一个小函数,该函数将使用第三方API收集一些数据。 Call if def MyFunc(Symbol, Field) that will return some info based on the symbol given. 如果def MyFunc(Symbol, Field)调用,它将基于给定的符号返回一些信息。

The idea was to fill a Pandas df with the returned value using something like: 想法是使用类似以下内容的返回值填充Pandas df:

df['MyNewField'] = df.apply(lamba x: MyFunc(x, 'FieldName'))

All this works BUT , each query takes around 100ms to run. 所有这些都有效, 每个查询大约需要100毫秒才能运行。 This seems fast until you realize you may have 30,000 or more to do (3,000 Symbols with 10 fields each for starters). 这似乎很快,直到您意识到自己可能要做30,000或更多(3,000个带有10个字段的符号供初学者使用)。

I was wondering if there would be a way to run this concurrently as each request is independent? 我想知道是否有一种方法可以同时运行此请求,因为每个请求都是独立的? I am not looking for multi processor etc. libraries but instead a way to do multiple queries to the 3rd party at the same time to reduce the time taken to gather all the data. 我不是在寻找多处理器等库,而是在同一时间对第三方进行多次查询的一种方式,以减少收集所有数据所需的时间。 (Also, I suppose this will change the initial structure used to store all the received data - I do not mind not using Apply and my dataframe at first and instead save the data as it is received on a text or library type structure -). (此外,我想这将改变用于存储所有接收到的数据的初始结构-我不介意先不使用Apply和我的数据框,而是将接收到的数据保存为文本或库类型结构-)。

NOTE: While I wish I could change MyFunc to request multiple symbols/fields at once this cannot be done for all cases (meaning some fields do not allow that and a single request is the only way to go). 注意:虽然我希望可以更改MyFunc来一次请求多个符号/字段,但这不能在所有情况下都完成(这意味着某些字段不允许这样做,并且只有一个请求是唯一的方法)。 This is why I am looking at concurrent execution and not at changing MyFunc. 这就是为什么我要查看并发执行而不是更改MyFunc的原因。

Thanks! 谢谢!

There are many libraries to parallelize pandas dataframe. 有很多库可以并行化pandas数据帧。 However, I prefer native multi-processing pool to do the same. 但是,我更喜欢本机多处理池来执行相同的操作。 Also, I use tqdm along with it to know the progress. 另外,我将tqdm与它一起使用以了解进度。

import numpy as np
from multiprocessing import cpu_count, Pool

cores = 4 #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want

def partition(data, num_partitions):
    partition_len = int(len(data)/num_partitions)
    partitions = []

    num_rows = 0
    for i in range(num_partitions-1):
        partition = data.iloc[i*partition_len:i*partition_len+partition_len]
        num_rows = num_rows + partition_len
        partitions.append(partition)

    partitions.append(data.iloc[num_rows:len(data)])
    return partitions

def parallelize(data, func):
    data_split = partition(data, partitions)
    pool = Pool(cores)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

df['MyNewField'] = parallelize(df['FieldName'], MyFunc)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM