简体   繁体   English

申请 function 的并行处理

[英]parallel processing for apply function

I have a dataframe of 100,000 records so i tried to do a Parallel processing using the joblib library which works fine with my code below, but my question is can i try the same code with 'apply' and 'lambda' function which seems like very close to my original code with minimum change instead of using the for loop like in my code.我有一个包含 100,000 条记录的 dataframe,所以我尝试使用 joblib 库进行并行处理,该库适用于我的代码,但我的问题是我可以尝试使用“应用”和“lambda”function 的相同代码,这看起来很像以最小的更改接近我的原始代码,而不是像在我的代码中那样使用 for 循环。 Please help请帮忙

Original Code - Without parallel processing:原始代码 - 没有并行处理:

df['b1'] = df.text1.apply(lambda x: removeNumbers(x))

With parallel processing:并行处理:

For the purpose of applying the Joblib's parallel processing i converted to for loop below为了应用 Joblib 的并行处理,我在下面转换为 for 循环

df['b1'] = Parallel(n_jobs = -1)(delayed(removeNumbers)(x) for x in df.text1)

I have the following code which I use when I have a large dataframe and want to use parallel computing:当我有一个大型 dataframe 并想要使用并行计算时,我有以下代码:

import numpy as np
import pandas as pd
import time
from multiprocessing import  pool, cpu_count
from functools import partial

# Wrapper to time functions (not needed for parallel computing but to show that it works...)
def time_function(func):
    def decorated_func(*args, **kwargs):
        start = time.perf_counter_ns()
        ret = func(*args, **kwargs)
        stop = time.perf_counter_ns()
        temp = []
        temp += [type(a) for a in args]
        f = lambda x: f"{x}={type(kwargs[x])}"
        temp += list(map(f, kwargs))
        print(f"Function {func.__name__}{*temp,}: time elapsed: {(stop - start)*1e-6:.3f} [ms]")
        return  ret
    return decorated_func

# This function splits the data and calls the functions.
def parallelize(data, func, num_of_processes=cpu_count()):
    data_split = np.array_split(data, num_of_processes)
    p = pool(num_of_processes)
    data = pd.concat(p.map(func, data_split))
    p.close()
    p.join()
    return data

# This function is only used for pandas (otherwise the parallelize function was enough)
def run_on_subset(func, data_subset):
    return data_subset.apply(func, axis=1)

# This function is maybe redundant, but it keeps the code readable.
def parallelize_on_rows(data, func, num_of_processes=8):
    return parallelize(data, partial(run_on_subset, func), num_of_processes)

def sum_two_columns(row):
    time.sleep(0.1) # Make it a time consuming function
    return row[0] + row[1]

@time_function
def oridnary_apply(df):
    return df.apply(sum_two_columns, axis=1)

@time_function
def parallel_apply(df):
    return parallelize_on_rows(df, sum_two_columns)

if __name__ == '__main__':
    array = np.ones((100, 3))
    df = pd.DataFrame(array)
    print(f"cpu_count: {cpu_count()}")
    oridnary_apply(df)
    parallel_apply(df)
    print('done')

>>> cpu_count: 12
>>> Function oridnary_apply(<class 'pandas.core.frame.DataFrame'>,): time elapsed: 10860.275 [ms]
>>> Function parallel_apply(<class 'pandas.core.frame.DataFrame'>,): time elapsed: 4520.432 [ms]
>>> done

EDIT:编辑:

When there are a lot of same values in your rows you can also cache the output of your function.当行中有很多相同的值时,您还可以缓存 function 的 output。 If it is a complex function this is also a way to improve your the time it cost to process your dataframe.如果它是复杂的 function,这也是一种缩短处理 dataframe 所需时间的方法。

https://docs.python.org/3/library/functools.html#functools.lru_cache https://docs.python.org/3/library/functools.html#functools.lru_cache

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM