简体   繁体   中英

Dask: applying custom function to DataFrame gets error

I'd like to speed up my DataFrame manipulations and have decided to use for this aim the dask library - but cannot use it with success. I have made a test example to show my problems:

import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.multiprocessing import get

def testfunc(good):
  return good*good

df = pd.DataFrame({'a' : [1,2,3], 'b' : [4,5,6], 'c' : [7,8,9]})
ddata = dd.from_pandas(df, npartitions=2)

df1 = ddata.map_partitions(lambda df: df.apply((lambda row: testfunc(*row)), axis=1)).compute(get=get)

But running this code I receive an error: TypeError: testfunc() takes 1 positional argument but 3 were given. Could you explain what is wrong in my code...

This will work with a minor change. You're currently unpacking the row object by using the asterisk. You probably want to directly pass the row, as is.

import numpy as np
import pandas as pd
import dask.dataframe as dd
​
def testfunc(good):
    return good*good
​
df = pd.DataFrame({'a' : [1,2,3], 'b' : [4,5,6], 'c' : [7,8,9]})
ddata = dd.from_pandas(df, npartitions=2)
​
df1 = ddata.map_partitions(lambda df: df.apply((lambda row: testfunc(row)), axis=1)).compute()
print(df1)
   a   b   c
0  1  16  49
1  4  25  64
2  9  36  81

For more information, you might want to check out the expression Python docs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM