简体   繁体   English

申请中的熊猫就地操作

[英]Pandas inplace operation in apply

I am expecting a strange pandas behaviour. 我期待一个奇怪的熊猫行为。 In the following code 在以下代码中

import numpy as np
import pandas as pd

def info(df):
    print(f"whole df: {hex(id(df))}")
    print(f"col a   : {hex(id(df['a']))}")
    print(f"col b   : {hex(id(df['b']))}")
    print(f"col c   : {hex(id(df['c']))}")

def _drop(col):
    print(f"called on  : {col.name}")
    print(f"before drop: {hex(id(col))}")
    col[0] = -1    
    col.dropna(inplace=True)
    col[0] = 1
    print(f"after drop : {hex(id(col))}")   


df = pd.DataFrame([[np.nan, 1.2, np.nan],
                   [5.8, np.nan, np.nan]], columns=['a', 'b', 'c'])

info(df)
df.apply(_drop)
info(df)

if I comment out the dropna() line, or call dropna(inplace=False) I get a result that I expected (because dropna creates a copy and I am modifying the original series): 如果我注释掉dropna()行,或者调用dropna(inplace=False)我会得到一个我期望的结果(因为dropna创建了一个副本而我正在修改原始系列):

     a    b    c
 0  1.0  1.0  1.0
 1  5.8  NaN  NaN

But when dropna(inplace=True) the operation should be done inplace, thus modifying the original series, but the result I get is: 但是当dropna(inplace=True)操作应该在dropna(inplace=True)进行,从而修改原始系列,但我得到的结果是:

     a    b    c
 0 -1.0 -1.0 -1.0
 1  5.8  NaN  NaN

However I would expect the result to be the same as in previous cases. 但是我希望结果与之前的情况相同。 Is dropna operation returning a clone even though the operation is inplace? dropna操作在dropna操作是否返回克隆? I am using pandas version 0.23.1. 我正在使用pandas版本0.23.1。

Edit: Based on provided answers I added hex(ids()) calls to verify actual instances. 编辑:根据提供的答案,我添加了hex(ids())调用来验证实际的实例。 The above code printed this (values might be different for you, but equality between them should be the same) 上面的代码打印了这个(值可能与您不同,但它们之间的相等应该相同)

whole df   : 0x1f482392f28
col a      : 0x1f482392f60
col b      : 0x1f48452af98
col c      : 0x1f48452ada0
called on  : a
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
called on  : b
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
called on  : a
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
called on  : b
before drop: 0x1f4ffef1ef0
after drop : 0x1f4ffef1ef0
called on  : c
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
whole df   : 0x1f482392f28
col a      : 0x1f482392f60
col b      : 0x1f48452af98
col c      : 0x1f48452ada0

It is weird that the function is called 2 times on columns a and b , however the docs says it is called twice only on the first column. 奇怪的是,在ab列上调用该函数2次,但是文档说它仅在第一列上调用了两次。

Additionally, the hex value for the second pass of column b is different. 另外,列b的第二遍的十六进制值是不同的。 Both does not happen when the col.drop() is omitted. 省略col.drop()时不会发生这两种情况。

The hex values suggests that .apply() creates a new copy of the columns, however how it propagates the values back to the original df is unknown to me. 十六进制值表明.apply()会创建列的新副本,但是我将这些值传播回原始df是未知的。

I tried to reason through this with variable scope concepts, wouldn't consider it as a full answer but maybe it will be insightful for someone else. 我试图用可变范围概念推理这个,不会把它看作是一个完整的答案,但也许它会对其他人有洞察力。

When .apply executes on each series corresponding here to the col argument, inside the scope of _drop() line col[0] = -1 changes globally "first row" of the df and therefore it mutates it. 当.apply在对应于col参数的每个系列上执行时,在_drop()行的范围内col [0] = -1会改变df的全局“第一行”,因此它会改变它。 When dropna() is called with inplace=True, NaNs are actually dropped but ONLY for the series inside the scope of that function, it's not assigned to the global df. 当使用inplace = True调用dropna()时,实际上会删除NaN,但仅限于该函数范围内的系列,它不会分配给全局df。 Even though it overwrites the variable col. 即使它覆盖了变量col。 Another insight might be that Docs says that .dropna(inplace=True) returns None and _drop() also would return None since there is no return statement. 另一个见解可能是Docs说.dropna(inplace = True)返回None而_drop()也会返回None,因为没有return语句。

It might be worth raising this issue at the pandas / numpy github - to me, this looks like unexpected behavior - If you add a return col statement to the function, your code works as expected. 可能值得在pandas / numpy github上提出这个问题 - 对我来说,这看起来像是意外行为 - 如果你向函数添加一个return col语句,你的代码就会按预期工作。 This indicates that indeed, a local copy is created. 这表明确实创建了本地副本。 print(hex(id(col))) confirms this. print(hex(id(col)))确认了这一点。

def _drop(col):
    col[0] = -1
    col.dropna(inplace=True)
    col[0] = 1
    return col # <----

df = pd.DataFrame([[np.nan, 1.2, np.nan],
                   [5.8, np.nan, np.nan]], columns=['a', 'b', 'c'])

df.apply(_drop)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM