[英]Pandas inplace operation in apply
I am expecting a strange pandas behaviour. 我期待一个奇怪的熊猫行为。 In the following code
在以下代码中
import numpy as np
import pandas as pd
def info(df):
print(f"whole df: {hex(id(df))}")
print(f"col a : {hex(id(df['a']))}")
print(f"col b : {hex(id(df['b']))}")
print(f"col c : {hex(id(df['c']))}")
def _drop(col):
print(f"called on : {col.name}")
print(f"before drop: {hex(id(col))}")
col[0] = -1
col.dropna(inplace=True)
col[0] = 1
print(f"after drop : {hex(id(col))}")
df = pd.DataFrame([[np.nan, 1.2, np.nan],
[5.8, np.nan, np.nan]], columns=['a', 'b', 'c'])
info(df)
df.apply(_drop)
info(df)
if I comment out the dropna()
line, or call dropna(inplace=False)
I get a result that I expected (because dropna
creates a copy and I am modifying the original series): 如果我注释掉
dropna()
行,或者调用dropna(inplace=False)
我会得到一个我期望的结果(因为dropna
创建了一个副本而我正在修改原始系列):
a b c
0 1.0 1.0 1.0
1 5.8 NaN NaN
But when dropna(inplace=True)
the operation should be done inplace, thus modifying the original series, but the result I get is: 但是当
dropna(inplace=True)
操作应该在dropna(inplace=True)
进行,从而修改原始系列,但我得到的结果是:
a b c
0 -1.0 -1.0 -1.0
1 5.8 NaN NaN
However I would expect the result to be the same as in previous cases. 但是我希望结果与之前的情况相同。 Is
dropna
operation returning a clone even though the operation is inplace? dropna
操作在dropna
操作是否返回克隆? I am using pandas version 0.23.1. 我正在使用pandas版本0.23.1。
Edit: Based on provided answers I added hex(ids())
calls to verify actual instances. 编辑:根据提供的答案,我添加了
hex(ids())
调用来验证实际的实例。 The above code printed this (values might be different for you, but equality between them should be the same) 上面的代码打印了这个(值可能与您不同,但它们之间的相等应该相同)
whole df : 0x1f482392f28
col a : 0x1f482392f60
col b : 0x1f48452af98
col c : 0x1f48452ada0
called on : a
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
called on : b
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
called on : a
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
called on : b
before drop: 0x1f4ffef1ef0
after drop : 0x1f4ffef1ef0
called on : c
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
whole df : 0x1f482392f28
col a : 0x1f482392f60
col b : 0x1f48452af98
col c : 0x1f48452ada0
It is weird that the function is called 2 times on columns a
and b
, however the docs says it is called twice only on the first column. 奇怪的是,在
a
和b
列上调用该函数2次,但是文档说它仅在第一列上调用了两次。
Additionally, the hex value for the second pass of column b
is different. 另外,列
b
的第二遍的十六进制值是不同的。 Both does not happen when the col.drop()
is omitted. 省略
col.drop()
时不会发生这两种情况。
The hex values suggests that .apply()
creates a new copy of the columns, however how it propagates the values back to the original df
is unknown to me. 十六进制值表明
.apply()
会创建列的新副本,但是我将这些值传播回原始df
是未知的。
I tried to reason through this with variable scope concepts, wouldn't consider it as a full answer but maybe it will be insightful for someone else. 我试图用可变范围概念推理这个,不会把它看作是一个完整的答案,但也许它会对其他人有洞察力。
When .apply executes on each series corresponding here to the col argument, inside the scope of _drop() line col[0] = -1 changes globally "first row" of the df and therefore it mutates it. 当.apply在对应于col参数的每个系列上执行时,在_drop()行的范围内col [0] = -1会改变df的全局“第一行”,因此它会改变它。 When dropna() is called with inplace=True, NaNs are actually dropped but ONLY for the series inside the scope of that function, it's not assigned to the global df.
当使用inplace = True调用dropna()时,实际上会删除NaN,但仅限于该函数范围内的系列,它不会分配给全局df。 Even though it overwrites the variable col.
即使它覆盖了变量col。 Another insight might be that Docs says that .dropna(inplace=True) returns None and _drop() also would return None since there is no return statement.
另一个见解可能是Docs说.dropna(inplace = True)返回None而_drop()也会返回None,因为没有return语句。
It might be worth raising this issue at the pandas / numpy github - to me, this looks like unexpected behavior - If you add a return col
statement to the function, your code works as expected. 可能值得在pandas / numpy github上提出这个问题 - 对我来说,这看起来像是意外行为 - 如果你向函数添加一个
return col
语句,你的代码就会按预期工作。 This indicates that indeed, a local copy is created. 这表明确实创建了本地副本。
print(hex(id(col)))
confirms this. print(hex(id(col)))
确认了这一点。
def _drop(col):
col[0] = -1
col.dropna(inplace=True)
col[0] = 1
return col # <----
df = pd.DataFrame([[np.nan, 1.2, np.nan],
[5.8, np.nan, np.nan]], columns=['a', 'b', 'c'])
df.apply(_drop)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.