简体   繁体   English

将 lambda 函数应用于 dask 数据帧

[英]apply a lambda function to a dask dataframe

I am looking to apply a lambda function to a dask dataframe to change the lables in a column if its less than a certain percentage.我希望将lambda函数应用于 dask 数据框以更改列中的标签,如果它小于某个百分比。 The method that I am using works well for a pandas dataframe but the same code does not work for dask a dataframe.我使用的方法适用于 Pandas 数据帧,但相同的代码不适用于 dask 数据帧。 The code is below.代码如下。

df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']})
ddf = dd.from_pandas(df, npartitions=2)

df:

output:输出:

     A     B      C
0   ant    cat   dog
1   ant    peach dog
2   cherry cat   roo
3   bee    cat   emu
4   ant    peach emu
ddf.compute()

output:输出:

     A     B      C
0   ant    cat   dog
1   ant    peach dog
2   cherry cat   roo
3   bee    cat   emu
4   ant    peach emu
list_ = ['B','C']
df.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x)

output:输出:

     A     B      C
0   ant    cat   dog
1   ant    peach dog
2   other  cat   roo
3   other  cat   emu
4   ant    peach emu

Do the same for dask dataframe:对 dask 数据框执行相同操作:

ddf.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x,axis=1).compute()

output(gives warning and not the output required):输出(给出警告而不是所需的输出):

/home/michael/env/lib/python3.5/site-packages/dask/dataframe/core.py:3107: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)
      A       B       C
0   other   other   other
1   other   other   other
2   other   other   other
3   other   other   other
4   other   other   other

Could someone be able to help me out to get the required output for the dask dataframe instance.有人可以帮助我获得 dask 数据帧实例所需的输出。

Thanks谢谢

Michael迈克尔

You are not performing the same thing in the pandas and dask cases: for the latter you have axis=1 , so you end up replacing any value which occurs less than twice in a given row , which is all of them.在 pandas 和 dask 情况下,您没有执行相同的操作:对于后者,您有axis=1 ,因此您最终会替换给定行中出现少于两次的任何值,这就是全部。

If you change to axis=0 , you will see that you get an exception.如果您更改为axis=0 ,您将看到您收到异常。 This is because to compute, say, the first partition, you would need the whole dataframe also to be passed to the lambda function - else how could you get the value_counts?这是因为要计算第一个分区,您还需要将整个数据帧传递给 lambda 函数 - 否则如何获得 value_counts?

The solution to your problem would be to get the value counts separately.您的问题的解决方案是分别获取值计数。 You could explicitly compute this (the result is small) or pass it to the lambda.您可以显式地计算它(结果很小)或将其传递给 lambda。 Note furthermore that going this path means you can avoid using apply in favour of map and making things more explicit.此外请注意,走这条路意味着您可以避免使用apply来支持map并使事情更加明确。 Here I am exclusively picking the one column, you could loop.在这里我专门选择一列,你可以循环。

vc = ddf.A.value_counts().compute()
vc /= vc.sum()  # because dask's value_count doesn't normalise

def simple_map(df):
    df['A'] = df['A'].map(lambda x: x if vc[x] > 0.5 else 'other')
    return df

ddf.map_partitions(simple_map, meta=df[:0]).compute()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM