apply a lambda function to a dask dataframe

Question

I am looking to apply a lambda function to a dask dataframe to change the lables in a column if its less than a certain percentage. The method that I am using works well for a pandas dataframe but the same code does not work for dask a dataframe. The code is below.

df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']})
ddf = dd.from_pandas(df, npartitions=2)

df:

output:

     A     B      C
0   ant    cat   dog
1   ant    peach dog
2   cherry cat   roo
3   bee    cat   emu
4   ant    peach emu

ddf.compute()

output:

     A     B      C
0   ant    cat   dog
1   ant    peach dog
2   cherry cat   roo
3   bee    cat   emu
4   ant    peach emu

list_ = ['B','C']
df.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x)

output:

     A     B      C
0   ant    cat   dog
1   ant    peach dog
2   other  cat   roo
3   other  cat   emu
4   ant    peach emu

Do the same for dask dataframe:

ddf.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x,axis=1).compute()

output(gives warning and not the output required):

/home/michael/env/lib/python3.5/site-packages/dask/dataframe/core.py:3107: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)
      A       B       C
0   other   other   other
1   other   other   other
2   other   other   other
3   other   other   other
4   other   other   other

Could someone be able to help me out to get the required output for the dask dataframe instance.

Thanks

Michael

Answer 1

You are not performing the same thing in the pandas and dask cases: for the latter you have axis=1 , so you end up replacing any value which occurs less than twice in a given row , which is all of them.

If you change to axis=0 , you will see that you get an exception. This is because to compute, say, the first partition, you would need the whole dataframe also to be passed to the lambda function - else how could you get the value_counts?

The solution to your problem would be to get the value counts separately. You could explicitly compute this (the result is small) or pass it to the lambda. Note furthermore that going this path means you can avoid using apply in favour of map and making things more explicit. Here I am exclusively picking the one column, you could loop.

vc = ddf.A.value_counts().compute()
vc /= vc.sum()  # because dask's value_count doesn't normalise

def simple_map(df):
    df['A'] = df['A'].map(lambda x: x if vc[x] > 0.5 else 'other')
    return df

ddf.map_partitions(simple_map, meta=df[:0]).compute()

apply a lambda function to a dask dataframe

Question

1 answers

solution1
2 ACCPTED 2019-03-02 17:42:08

apply a lambda function to a dask dataframe

Question

1 answers

solution1 2 ACCPTED 2019-03-02 17:42:08

solution1
2 ACCPTED 2019-03-02 17:42:08