简体   繁体   English

当将局部变量作为参数传递时,Dask 数据帧应用给出了意外的结果

[英]Dask dataframe apply giving unexpected results when passing local variables as argument

When calling the apply method of a dask DataFrame inside a for loop where I use the iterator variable as an argument to apply , I get unexpected results when performing the calculation later.当在 for 循环中调用 dask DataFrameapply方法时,我使用迭代器变量作为apply的参数,稍后执行计算时会得到意想不到的结果。 This example shows the behavior:此示例显示了行为:

import dask.dataframe as dd
import random
import numpy as np

df = pd.DataFrame({'col_1':random.sample(range(10000), 10000), 
                   'col_2': random.sample(range(10000), 10000) })
ddf = dd.from_pandas(df, npartitions=8)

def myfunc(x, channel):
    return channel

for ch in ['ch1','ch2']:
    ddf[f'df_apply_{ch}'] = ddf.apply(lambda row: myfunc(row,ch), axis=1, meta=(f'df_apply_{ch}', np.unicode_))

print(ddf.head(5))

From the row-wise application of myfunc I expect to see two additional columns, one with "ch1" and one with "ch2" on each row.myfunc的逐行应用程序中,我希望看到两列额外的列,每行一个带有“ch1”,一个带有“ch2”。 However, this is the output of the script:但是,这是脚本的输出:

   col_1  col_2 df_apply_ch1 df_apply_ch2
0   5485   2234          ch2          ch2
1   6338   6802          ch2          ch2
2   9408   5760          ch2          ch2
3   8447   1451          ch2          ch2
4   1230   3838          ch2          ch2

Apparently, the final iteration of the loop overwrote the first argument to apply .显然,循环的最后一次迭代覆盖了apply的第一个参数。 In fact, any later changes to ch between the loop and the call to head affect the result the same way, overwriting what I expect to see in both columns.事实上,在循环和调用head之间对ch任何后续更改都会以相同的方式影响结果,覆盖我期望在两列中看到的内容。

This is not what one sees doing the same exercise with pure pandas.这不是人们看到的用纯熊猫做同样的练习。 And I found a work-around for dask as well:我也找到了 dask 的解决方法:

def myapply(ddf, ch):
    ddf[f'myapply_{ch}'] = ddf.apply(lambda row: myfunc(row,ch), axis=1, meta=(f'myapply_{ch}', np.unicode_))

for ch in ['ch1','ch2']:
    myapply(ddf, ch)

print(ddf.head(10))

gives:给出:

   col_1  col_2 myapply_ch1 myapply_ch2
0   7394   3528         ch1         ch2
1   2181   6681         ch1         ch2
2   7945   1063         ch1         ch2
3   5164   8091         ch1         ch2
4   3569   2889         ch1         ch2

So I see that this has to do with the scope of the variable used as argument to apply, but I don't understand why exactly this happens with dask (only).所以我看到这与用作要应用的参数的变量的范围有关,但我不明白为什么 dask (仅)会发生这种情况。 Is this the intended/expected behavior?这是预期/预期的行为吗?

Any insights would be appreciated!任何见解将不胜感激! :) :)

This turns out to be a duplicate after all, see question on stackoverlow including another work-around.这毕竟是重复的,请参阅stackoverlow 上的问题,包括另一个解决方法。 A more detailed explanation of the behavior can be found in the correspondingissue on the dask tracker :可以在 dask 跟踪器上的相应问题中找到对该行为的更详细说明:

This isn't a bug, this is just how python works.这不是错误,这就是python 的工作方式。 Closures evaluate based on the defining scope, if you change the value of trig in that scope then the closure will evaluate differently.闭包根据定义的范围进行评估,如果您在该范围内更改了trig的值,那么闭包的评估将有所不同。 The issue here is that this code would run fine in pandas, since there is an evaluation in each loop, but in dask all the evaluations are delayed until later, and thus all use the same value for trig .这里的问题是,这段代码在 Pandas 中运行良好,因为每个循环中都有一个评估,但在 dask 中,所有评估都延迟到以后,因此都对trig使用相同的值。

Where trig is the variable in the loop used in that discussion.其中, trig是该讨论中使用的循环中的变量。

So this is not a bug and a feature of Python triggered by dask but not pandas.因此,这不是由 dask 而非 Pandas 触发的错误和 Python 特性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM