简体   繁体   中英

apply a lambda function to a dask dataframe

I am looking to apply a lambda function to a dask dataframe to change the lables in a column if its less than a certain percentage. The method that I am using works well for a pandas dataframe but the same code does not work for dask a dataframe. The code is below.

df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']})
ddf = dd.from_pandas(df, npartitions=2)

df:

output:

     A     B      C
0   ant    cat   dog
1   ant    peach dog
2   cherry cat   roo
3   bee    cat   emu
4   ant    peach emu
ddf.compute()

output:

     A     B      C
0   ant    cat   dog
1   ant    peach dog
2   cherry cat   roo
3   bee    cat   emu
4   ant    peach emu
list_ = ['B','C']
df.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x)

output:

     A     B      C
0   ant    cat   dog
1   ant    peach dog
2   other  cat   roo
3   other  cat   emu
4   ant    peach emu

Do the same for dask dataframe:

ddf.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x,axis=1).compute()

output(gives warning and not the output required):

/home/michael/env/lib/python3.5/site-packages/dask/dataframe/core.py:3107: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)
      A       B       C
0   other   other   other
1   other   other   other
2   other   other   other
3   other   other   other
4   other   other   other

Could someone be able to help me out to get the required output for the dask dataframe instance.

Thanks

Michael

You are not performing the same thing in the pandas and dask cases: for the latter you have axis=1 , so you end up replacing any value which occurs less than twice in a given row , which is all of them.

If you change to axis=0 , you will see that you get an exception. This is because to compute, say, the first partition, you would need the whole dataframe also to be passed to the lambda function - else how could you get the value_counts?

The solution to your problem would be to get the value counts separately. You could explicitly compute this (the result is small) or pass it to the lambda. Note furthermore that going this path means you can avoid using apply in favour of map and making things more explicit. Here I am exclusively picking the one column, you could loop.

vc = ddf.A.value_counts().compute()
vc /= vc.sum()  # because dask's value_count doesn't normalise

def simple_map(df):
    df['A'] = df['A'].map(lambda x: x if vc[x] > 0.5 else 'other')
    return df

ddf.map_partitions(simple_map, meta=df[:0]).compute()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM