[英]apply a lambda function to a dask dataframe
I am looking to apply a lambda
function to a dask dataframe to change the lables in a column if its less than a certain percentage.我希望将
lambda
函数应用于 dask 数据框以更改列中的标签,如果它小于某个百分比。 The method that I am using works well for a pandas dataframe but the same code does not work for dask a dataframe.我使用的方法适用于 Pandas 数据帧,但相同的代码不适用于 dask 数据帧。 The code is below.
代码如下。
df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']})
ddf = dd.from_pandas(df, npartitions=2)
df:
output:输出:
A B C
0 ant cat dog
1 ant peach dog
2 cherry cat roo
3 bee cat emu
4 ant peach emu
ddf.compute()
output:输出:
A B C
0 ant cat dog
1 ant peach dog
2 cherry cat roo
3 bee cat emu
4 ant peach emu
list_ = ['B','C']
df.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x)
output:输出:
A B C
0 ant cat dog
1 ant peach dog
2 other cat roo
3 other cat emu
4 ant peach emu
Do the same for dask dataframe:对 dask 数据框执行相同操作:
ddf.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x,axis=1).compute()
output(gives warning and not the output required):输出(给出警告而不是所需的输出):
/home/michael/env/lib/python3.5/site-packages/dask/dataframe/core.py:3107: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
warnings.warn(msg)
A B C
0 other other other
1 other other other
2 other other other
3 other other other
4 other other other
Could someone be able to help me out to get the required output for the dask dataframe instance.有人可以帮助我获得 dask 数据帧实例所需的输出。
Thanks谢谢
Michael迈克尔
You are not performing the same thing in the pandas and dask cases: for the latter you have axis=1
, so you end up replacing any value which occurs less than twice in a given row , which is all of them.在 pandas 和 dask 情况下,您没有执行相同的操作:对于后者,您有
axis=1
,因此您最终会替换给定行中出现少于两次的任何值,这就是全部。
If you change to axis=0
, you will see that you get an exception.如果您更改为
axis=0
,您将看到您收到异常。 This is because to compute, say, the first partition, you would need the whole dataframe also to be passed to the lambda function - else how could you get the value_counts?这是因为要计算第一个分区,您还需要将整个数据帧传递给 lambda 函数 - 否则如何获得 value_counts?
The solution to your problem would be to get the value counts separately.您的问题的解决方案是分别获取值计数。 You could explicitly compute this (the result is small) or pass it to the lambda.
您可以显式地计算它(结果很小)或将其传递给 lambda。 Note furthermore that going this path means you can avoid using
apply
in favour of map
and making things more explicit.此外请注意,走这条路意味着您可以避免使用
apply
来支持map
并使事情更加明确。 Here I am exclusively picking the one column, you could loop.在这里我专门选择一列,你可以循环。
vc = ddf.A.value_counts().compute()
vc /= vc.sum() # because dask's value_count doesn't normalise
def simple_map(df):
df['A'] = df['A'].map(lambda x: x if vc[x] > 0.5 else 'other')
return df
ddf.map_partitions(simple_map, meta=df[:0]).compute()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.