[英]pandas df.apply unexpectedly changes dataframe inplace
From my understanding, pandas.DataFrame.apply does not apply changes inplace and we should use its return object to persist any changes.根据我的理解,pandas.DataFrame.apply 不会就地应用更改,我们应该使用其返回对象来保留任何更改。 However, I've found the following inconsistent behavior:
但是,我发现了以下不一致的行为:
Let's apply a dummy function for the sake of ensuring that the original df remains untouched:为了确保原始 df 保持不变,让我们应用一个虚拟函数:
>>> def foo(row: pd.Series):
... row['b'] = '42'
>>> df = pd.DataFrame([('a0','b0'),('a1','b1')], columns=['a', 'b'])
>>> df.apply(foo, axis=1)
>>> df
a b
0 a0 b0
1 a1 b1
This behaves as expected.这符合预期。 However, foo will apply the changes inplace if we modify the way we initialize this df:
但是,如果我们修改初始化此 df 的方式, foo 将就地应用更改:
>>> df2 = pd.DataFrame(columns=['a', 'b'])
>>> df2['a'] = ['a0','a1']
>>> df2['b'] = ['b0','b1']
>>> df2.apply(foo, axis=1)
>>> df2
a b
0 a0 42
1 a1 42
I've also noticed that the above is not true if the columns dtypes are not of type 'object'.我还注意到,如果列 dtypes 不是“对象”类型,则上述情况不成立。 Why does apply() behave differently in these two contexts?
为什么 apply() 在这两种情况下表现不同?
Python: 3.6.5蟒蛇:3.6.5
Pandas: 0.23.1熊猫:0.23.1
Interesting question!有趣的问题! I believe the behavior you're seeing is an artifact of the way you use
apply
.我相信您所看到的行为是您使用
apply
方式的apply
。
As you correctly indicate, apply
is not intended to be used to modify a dataframe.正如您正确指出的那样,
apply
不打算用于修改数据框。 However, since apply
takes an arbitrary function, it doesn't guarantee that applying the function will be idempotent and will not change the dataframe.但是,由于
apply
接受一个任意函数,因此不能保证应用该函数是幂等的并且不会更改数据帧。 Here, you've found a great example of that behavior, because your function foo
attempts to modify the row that it is passed by apply
.在这里,您找到了该行为的一个很好的示例,因为您的函数
foo
尝试修改apply
传递的行。
Using apply
to modify a row could lead to these side effects.使用
apply
修改一行可能会导致这些副作用。 This isn't the best practice.这不是最佳做法。
Instead, consider this idiomatic approach for apply
.相反,考虑这种惯用方法
apply
。 The function apply
is often used to create a new column.函数
apply
通常用于创建新列。 Here's an example of how apply
is typically used, which I believe would steer you away from this potentially troublesome area:下面是一个通常如何
apply
的例子,我相信它会引导你远离这个潜在的麻烦领域:
import pandas as pd
# construct df2 just like you did
df2 = pd.DataFrame(columns=['a', 'b'])
df2['a'] = ['a0','b0']
df2['b'] = ['a1','b1']
df2['b_copy'] = df2.apply(lambda row: row['b'], axis=1) # apply to each row
df2['b_replace'] = df2.apply(lambda row: '42', axis=1)
df2['b_reverse'] = df2['b'].apply(lambda val: val[::-1]) # apply to each value in b column
print(df2)
# output:
# a b b_copy b_replace b_reverse
# 0 a0 a1 a1 42 1a
# 1 b0 b1 b1 42 1b
Notice that pandas passed a row or a cell to the function you give as the first argument to apply
, then stores the function's output in a column of your choice.请注意,pandas 将一行或一个单元格传递给您作为
apply
的第一个参数提供的函数,然后将该函数的输出存储在您选择的列中。
If you'd like to modify a dataframe row-by-row, take a look at iterrows
and loc
for the most idiomatic route.如果您想逐行修改数据帧,请查看
iterrows
和loc
以获取最惯用的路线。
Maybe late but I think it may help especially for someone who reach this question.也许晚了,但我认为这可能对提出这个问题的人特别有帮助。
When we use the foo
like:当我们使用
foo
:
def foo(row: pd.Series):
row['b'] = '42'
and then use it in:然后在:
df.apply(foo, axis=1)
we won't expect to occur any change in df
but it occers.我们不会期望
df
发生任何变化,但它会发生。 why?为什么?
Let's review what happens under the hood:让我们回顾一下幕后发生的事情:
apply
function calls foo
and pass one row to it. apply
函数调用foo
并将一行传递给它。 As it is not of type of specific types
in python (like int, float, str, ...) but is an object, so by python rules it is passed by reference not by value.由于它不是 Python 中特定
types
(如 int、float、str 等),而是一个对象,因此根据 Python 规则,它是通过引用而不是值传递的。 So it is completely equivalent with the row that is sent by apply
function.(Equal in values and both points to same block of ram.) So any change to row
in foo
function will changes the row
- which its type is pandas.series
and that points to a block of memory that df.row
resides - immediately.所以它完全等同于
apply
函数发送的行。(值相等并且都指向同一个内存块。)所以在foo
函数中对row
任何更改都会改变row
- 它的类型是pandas.series
和指向df.row
驻留的内存块 - 立即。
We can rewrite the foo
(I name it bar
) function to not change anything inplace .我们可以重写
foo
(我命名bar
)函数不会改变任何东西就地。 ( by deep copying row
that means make another row with same value(s) but on another cell of ram). (通过深度复制
row
,这意味着使用相同的值制作另一行,但在 ram 的另一个单元格上)。 This is what relly happens when we use lambda
in apply
function.这就是我们在
apply
函数中使用lambda
时真正发生的事情。
def bar(row: pd.Series):
row_temp=row.copy(deep=True)
row_temp['b'] = '42'
return row_temp
Complete Code完整代码
import pandas as pd
#Changes df in place -- not like lamda
def foo(row: pd.Series):
row['b'] = '42'
#Do not change df inplace -- works like lambda
def bar(row: pd.Series):
row_temp = row.copy(deep=True)
row_temp['b'] = '42'
return row_temp
df2 = pd.DataFrame(columns=['a', 'b'])
df2['a'] = ['a0', 'a1']
df2['b'] = ['b0', 'b1']
print(df2)
# No change inplace
df_b = df2.apply(bar, axis=1)
print(df2)
# bar function works
print(df_b)
print(df2)
# Changes inplace
df2.apply(foo, axis=1)
print(df2)
Output输出
#df2 before any change
a b
0 a0 b0
1 a1 b1
#calling df2.apply(bar, axis=1) not changed df2 inplace
a b
0 a0 b0
1 a1 b1
#df_b = df2.apply(bar, axis=1) #bar is working as expected
a b
0 a0 42
1 a1 42
#print df2 again to assure it is not changed
a b
0 a0 b0
1 a1 b1
#call df2.apply(foo, axis=1) -- as we see foo changed df2 inplace ( to compare with bar)
a b
0 a0 42
1 a1 42
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.