I am looking to apply a lambda
function to a dask dataframe to change the lables in a column if its less than a certain percentage. The method that I am using works well for a pandas dataframe but the same code does not work for dask a dataframe. The code is below.
df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']})
ddf = dd.from_pandas(df, npartitions=2)
df:
output:
A B C
0 ant cat dog
1 ant peach dog
2 cherry cat roo
3 bee cat emu
4 ant peach emu
ddf.compute()
output:
A B C
0 ant cat dog
1 ant peach dog
2 cherry cat roo
3 bee cat emu
4 ant peach emu
list_ = ['B','C']
df.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x)
output:
A B C
0 ant cat dog
1 ant peach dog
2 other cat roo
3 other cat emu
4 ant peach emu
Do the same for dask dataframe:
ddf.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x,axis=1).compute()
output(gives warning and not the output required):
/home/michael/env/lib/python3.5/site-packages/dask/dataframe/core.py:3107: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
warnings.warn(msg)
A B C
0 other other other
1 other other other
2 other other other
3 other other other
4 other other other
Could someone be able to help me out to get the required output for the dask dataframe instance.
Thanks
Michael
You are not performing the same thing in the pandas and dask cases: for the latter you have axis=1
, so you end up replacing any value which occurs less than twice in a given row , which is all of them.
If you change to axis=0
, you will see that you get an exception. This is because to compute, say, the first partition, you would need the whole dataframe also to be passed to the lambda function - else how could you get the value_counts?
The solution to your problem would be to get the value counts separately. You could explicitly compute this (the result is small) or pass it to the lambda. Note furthermore that going this path means you can avoid using apply
in favour of map
and making things more explicit. Here I am exclusively picking the one column, you could loop.
vc = ddf.A.value_counts().compute()
vc /= vc.sum() # because dask's value_count doesn't normalise
def simple_map(df):
df['A'] = df['A'].map(lambda x: x if vc[x] > 0.5 else 'other')
return df
ddf.map_partitions(simple_map, meta=df[:0]).compute()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.