简体   繁体   English

pandas df.apply 意外地就地更改了数据框

[英]pandas df.apply unexpectedly changes dataframe inplace

From my understanding, pandas.DataFrame.apply does not apply changes inplace and we should use its return object to persist any changes.根据我的理解,pandas.DataFrame.apply 不会就地应用更改,我们应该使用其返回对象来保留任何更改。 However, I've found the following inconsistent behavior:但是,我发现了以下不一致的行为:

Let's apply a dummy function for the sake of ensuring that the original df remains untouched:为了确保原始 df 保持不变,让我们应用一个虚拟函数:

>>> def foo(row: pd.Series):
...     row['b'] = '42'

>>> df = pd.DataFrame([('a0','b0'),('a1','b1')], columns=['a', 'b'])
>>> df.apply(foo, axis=1)
>>> df
    a   b
0   a0  b0
1   a1  b1

This behaves as expected.这符合预期。 However, foo will apply the changes inplace if we modify the way we initialize this df:但是,如果我们修改初始化此 df 的方式, foo 将就地应用更改:

>>> df2 = pd.DataFrame(columns=['a', 'b'])
>>> df2['a'] = ['a0','a1']
>>> df2['b'] = ['b0','b1']
>>> df2.apply(foo, axis=1)
>>> df2
    a   b
0   a0  42
1   a1  42

I've also noticed that the above is not true if the columns dtypes are not of type 'object'.我还注意到,如果列 dtypes 不是“对象”类型,则上述情况不成立。 Why does apply() behave differently in these two contexts?为什么 apply() 在这两种情况下表现不同?

Python: 3.6.5蟒蛇:3.6.5

Pandas: 0.23.1熊猫:0.23.1

Interesting question!有趣的问题! I believe the behavior you're seeing is an artifact of the way you use apply .我相信您所看到的行为是您使用apply方式的apply

As you correctly indicate, apply is not intended to be used to modify a dataframe.正如您正确指出的那样, apply不打算用于修改数据框。 However, since apply takes an arbitrary function, it doesn't guarantee that applying the function will be idempotent and will not change the dataframe.但是,由于apply接受一个任意函数,因此不能保证应用该函数是幂等的并且不会更改数据帧。 Here, you've found a great example of that behavior, because your function foo attempts to modify the row that it is passed by apply .在这里,您找到了该行为的一个很好的示例,因为您的函数foo尝试修改apply传递的行。

Using apply to modify a row could lead to these side effects.使用apply修改一行可能会导致这些副作用。 This isn't the best practice.这不是最佳做法。

Instead, consider this idiomatic approach for apply .相反,考虑这种惯用方法apply The function apply is often used to create a new column.函数apply通常用于创建新列。 Here's an example of how apply is typically used, which I believe would steer you away from this potentially troublesome area:下面是一个通常如何apply的例子,我相信它会引导你远离这个潜在的麻烦领域:

import pandas as pd
# construct df2 just like you did
df2 = pd.DataFrame(columns=['a', 'b'])
df2['a'] = ['a0','b0']
df2['b'] = ['a1','b1']

df2['b_copy'] = df2.apply(lambda row: row['b'], axis=1) # apply to each row
df2['b_replace'] = df2.apply(lambda row: '42', axis=1) 
df2['b_reverse'] = df2['b'].apply(lambda val: val[::-1]) # apply to each value in b column

print(df2)

# output:
#     a   b b_copy b_replace b_reverse
# 0  a0  a1     a1        42        1a
# 1  b0  b1     b1        42        1b

Notice that pandas passed a row or a cell to the function you give as the first argument to apply , then stores the function's output in a column of your choice.请注意,pandas 将一行或一个单元格传递给您作为apply的第一个参数提供的函数,然后将该函数的输出存储在您选择的列中。

If you'd like to modify a dataframe row-by-row, take a look at iterrows and loc for the most idiomatic route.如果您想逐行修改数据帧,请查看iterrowsloc以获取最惯用的路线。

Maybe late but I think it may help especially for someone who reach this question.也许晚了,但我认为这可能对提出这个问题的人特别有帮助。

When we use the foo like:当我们使用foo

def foo(row: pd.Series):
    row['b'] = '42'

and then use it in:然后在:

df.apply(foo, axis=1)

we won't expect to occur any change in df but it occers.我们不会期望df发生任何变化,但它会发生。 why?为什么?

Let's review what happens under the hood:让我们回顾一下幕后发生的事情:

apply function calls foo and pass one row to it. apply函数调用foo并将一行传递给它。 As it is not of type of specific types in python (like int, float, str, ...) but is an object, so by python rules it is passed by reference not by value.由于它不是 Python 中特定types (如 int、float、str 等),而是一个对象,因此根据 Python 规则,它是通过引用而不是值传递的。 So it is completely equivalent with the row that is sent by apply function.(Equal in values and both points to same block of ram.) So any change to row in foo function will changes the row - which its type is pandas.series and that points to a block of memory that df.row resides - immediately.所以它完全等同于apply函数发送的行。(值相等并且都指向同一个内存块。)所以在foo函数中对row任何更改都会改变row - 它的类型是pandas.series和指向df.row驻留的内存块 - 立即。

We can rewrite the foo (I name it bar ) function to not change anything inplace .我们可以重写foo (我命名bar )函数不会改变任何东西就地 ( by deep copying row that means make another row with same value(s) but on another cell of ram). (通过深度复制row ,这意味着使用相同的值制作另一行,但在 ram 的另一个单元格上)。 This is what relly happens when we use lambda in apply function.这就是我们在apply函数中使用lambda时真正发生的事情。

def bar(row: pd.Series):
    row_temp=row.copy(deep=True)
    row_temp['b'] = '42'
    return row_temp

Complete Code完整代码

import pandas as pd


#Changes df in place -- not like lamda
def foo(row: pd.Series):
    row['b'] = '42'


#Do not change df inplace -- works like lambda
def bar(row: pd.Series):
    row_temp = row.copy(deep=True)
    row_temp['b'] = '42'
    return row_temp


df2 = pd.DataFrame(columns=['a', 'b'])
df2['a'] = ['a0', 'a1']
df2['b'] = ['b0', 'b1']

print(df2)

# No change inplace
df_b = df2.apply(bar, axis=1)
print(df2)
# bar function works
print(df_b)

print(df2)
# Changes inplace
df2.apply(foo, axis=1)
print(df2)


Output输出

#df2 before any change
    a   b
0  a0  b0
1  a1  b1

#calling df2.apply(bar, axis=1) not changed df2 inplace
    a   b
0  a0  b0
1  a1  b1

#df_b = df2.apply(bar, axis=1) #bar is working as expected
    a   b
0  a0  42
1  a1  42

#print df2 again to assure it is not changed
    a   b
0  a0  b0
1  a1  b1

#call df2.apply(foo, axis=1) -- as we see foo changed df2 inplace ( to compare with bar)
    a   b
0  a0  42
1  a1  42

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM