[英]Dask dataframe apply giving unexpected results when passing local variables as argument
When calling the apply
method of a dask DataFrame
inside a for loop where I use the iterator variable as an argument to apply
, I get unexpected results when performing the calculation later.当在 for 循环中调用 dask DataFrame
的apply
方法时,我使用迭代器变量作为apply
的参数,稍后执行计算时会得到意想不到的结果。 This example shows the behavior:此示例显示了行为:
import dask.dataframe as dd
import random
import numpy as np
df = pd.DataFrame({'col_1':random.sample(range(10000), 10000),
'col_2': random.sample(range(10000), 10000) })
ddf = dd.from_pandas(df, npartitions=8)
def myfunc(x, channel):
return channel
for ch in ['ch1','ch2']:
ddf[f'df_apply_{ch}'] = ddf.apply(lambda row: myfunc(row,ch), axis=1, meta=(f'df_apply_{ch}', np.unicode_))
print(ddf.head(5))
From the row-wise application of myfunc
I expect to see two additional columns, one with "ch1" and one with "ch2" on each row.从myfunc
的逐行应用程序中,我希望看到两列额外的列,每行一个带有“ch1”,一个带有“ch2”。 However, this is the output of the script:但是,这是脚本的输出:
col_1 col_2 df_apply_ch1 df_apply_ch2
0 5485 2234 ch2 ch2
1 6338 6802 ch2 ch2
2 9408 5760 ch2 ch2
3 8447 1451 ch2 ch2
4 1230 3838 ch2 ch2
Apparently, the final iteration of the loop overwrote the first argument to apply
.显然,循环的最后一次迭代覆盖了apply
的第一个参数。 In fact, any later changes to ch
between the loop and the call to head
affect the result the same way, overwriting what I expect to see in both columns.事实上,在循环和调用head
之间对ch
任何后续更改都会以相同的方式影响结果,覆盖我期望在两列中看到的内容。
This is not what one sees doing the same exercise with pure pandas.这不是人们看到的用纯熊猫做同样的练习。 And I found a work-around for dask as well:我也找到了 dask 的解决方法:
def myapply(ddf, ch):
ddf[f'myapply_{ch}'] = ddf.apply(lambda row: myfunc(row,ch), axis=1, meta=(f'myapply_{ch}', np.unicode_))
for ch in ['ch1','ch2']:
myapply(ddf, ch)
print(ddf.head(10))
gives:给出:
col_1 col_2 myapply_ch1 myapply_ch2
0 7394 3528 ch1 ch2
1 2181 6681 ch1 ch2
2 7945 1063 ch1 ch2
3 5164 8091 ch1 ch2
4 3569 2889 ch1 ch2
So I see that this has to do with the scope of the variable used as argument to apply, but I don't understand why exactly this happens with dask (only).所以我看到这与用作要应用的参数的变量的范围有关,但我不明白为什么 dask (仅)会发生这种情况。 Is this the intended/expected behavior?这是预期/预期的行为吗?
Any insights would be appreciated!任何见解将不胜感激! :) :)
This turns out to be a duplicate after all, see question on stackoverlow including another work-around.这毕竟是重复的,请参阅stackoverlow 上的问题,包括另一个解决方法。 A more detailed explanation of the behavior can be found in the correspondingissue on the dask tracker :可以在 dask 跟踪器上的相应问题中找到对该行为的更详细说明:
This isn't a bug, this is just how python works.这不是错误,这就是python 的工作方式。 Closures evaluate based on the defining scope, if you change the value of
trig
in that scope then the closure will evaluate differently.闭包根据定义的范围进行评估,如果您在该范围内更改了trig
的值,那么闭包的评估将有所不同。 The issue here is that this code would run fine in pandas, since there is an evaluation in each loop, but in dask all the evaluations are delayed until later, and thus all use the same value fortrig
.这里的问题是,这段代码在 Pandas 中运行良好,因为每个循环中都有一个评估,但在 dask 中,所有评估都延迟到以后,因此都对trig
使用相同的值。
Where trig
is the variable in the loop used in that discussion.其中, trig
是该讨论中使用的循环中的变量。
So this is not a bug and a feature of Python triggered by dask but not pandas.因此,这不是由 dask 而非 Pandas 触发的错误和 Python 特性。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.