简体   繁体   中英

Run function with positional and optional arguments in parallel in python

I'm trying to compute various metrics on a Pandas DataFrame using the apply method. Since the DataFrame I'm working with is quiet big (1 million rows x 20 columns), I decided to parallelize the computation process.

In order to reproduce the issue I'm having, I'm going to use the iris dataset. Here are the steps:

# Step 1: Import all required modules + load iris dataset to Pandas DataFrame

import pandas as pd
import numpy as np
import seaborn as sns
from multiprocessing import Pool

iris = pd.DataFrame(sns.load_dataset('iris'))

# Step 2: Define function that adds some metric to initial iris DataFrame

def add_metrics(data):
    data['x_1'] = data['species'].apply(lambda x: len(x))
    return data

# Step 3: Define parallelization function

num_partitions = 10 # number of partitions to split dataframe
num_cores = 4 

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

# Step 4: Add metrics to initial iris DataFrame using parallelization function

iris = parallelize_dataframe(iris, add_metrics)

The above process works perfectly well as it is BUT I want to be able to have additional positional and/or optional arguments in my add_metrics function. For example, my add_metrics function might look like the following:

def add_metrics(data, num, keep = False):
    data['x_1'] = data['species'].apply(lambda x: len(x))
    data['x_2'] = data['sepal_length'].apply(lambda x: x * num)
    if keep == True:
        data['x_3'] = data['sepal_width'].apply(lambda x: x * num)
    return data

Now, no matter how I try to call the parallelize_dataframe function I'm getting an error. For example:

iris = parallelize_dataframe(iris, add_metrics(iris, 2, keep = True)) throws a TypeError: 'DataFrame' object is not callable .

I'm fairly new to Python so I don't know what is going wrong here and how to fix my problem. I know the example I chose does not require parallel processing as the iris dataset only contains 150 observation. I used it to easily reproduce my problem.

Any help would be appreciated.

You can use the functools.partial to set variables in your function before passing to map.

def add(x,y):
    return(x+y)

a = [1, 2, 3]
import functools
map(functools.partial(add, y=2), a) # map object [3, 4, 5]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM