[英]Replace values in a dataframe column based on condition
I have a seemingly easy task. 我有一个看似简单的任务。 Dataframe with 2 columns: A and B. If values in B are larger than values in A - replace those values with values of A. I used to do this by doing
df.B[df.B > df.A] = df.A
, however recent upgrade of pandas started giving a SettingWithCopyWarning
when encountering this chained assignment. 具有2列的数据帧:A和B.如果B中的值大于A中的值 - 用值A替换这些值。我曾经通过执行
df.B[df.B > df.A] = df.A
来执行此操作df.B[df.B > df.A] = df.A
,大熊猫然而,最近的升级开始给一个SettingWithCopyWarning
遇到此链接分配的情况下。 Official documentation recommends using .loc
. 官方文档建议使用
.loc
。
Okay, I said, and did it through df.loc[df.B > df.A, 'B'] = df.A
and it all works fine, unless column B has all values of NaN
. 好吧,我说,并通过
df.loc[df.B > df.A, 'B'] = df.A
它并且一切正常,除非B列具有NaN
所有值。 Then something weird happens: 然后发生了一些奇怪的事:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, np.NaN, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 NaN
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 -9223372036854775808
2 3 -9223372036854775808
Now, if even one of B's elements satisfies the condition (larger than A), then it all works fine: 现在,如果B中的一个元素满足条件(大于A),那么一切正常:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, 4, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 4
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 NaN
1 2 2
2 3 NaN
But if none of Bs elements satisfy, then all NaN
s get replaces with -9223372036854775808
: 但如果没有Bs元素满足,那么所有
NaN
都会替换为-9223372036854775808
:
In [1]: df = pd.DataFrame({'A':[1,2,3],'B':[np.NaN,1,np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 1
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 1
2 3 -9223372036854775808
Is this a bug or a feature? 这是一个错误还是一个功能? How should I have done this replacement?
我该怎么做这个替换?
Thank you! 谢谢!
This is a buggie, fixed here . 这是一个在这里修复的buggie。
Since pandas allows basically anything to be set on the right-hand-side of an expression in loc, there are probably 10+ cases that need to be disambiguated. 由于pandas基本上允许在loc的表达式的右侧设置任何内容,因此可能需要消除10个以上的情况需要消除歧义。 To give you an idea:
给你一个想法:
df.loc[lhs, column] = rhs
where rhs could be: list,array,scalar
, and lhs could be: slice,tuple,scalar,array
其中rhs可以是:
list,array,scalar
和lhs可以是: slice,tuple,scalar,array
and a small subset of cases where the resulting dtype of the column needs to be inferred / set according to the rhs. 以及需要根据rhs推断/设置得到的列的dtype的一小部分情况。 (This is a bit complicated).
(这有点复杂)。 For example say you don't set all of the elements on the lhs and it was integer, then you need to coerce to float.
例如,假设你没有在lhs上设置所有元素并且它是整数,那么你需要强制浮动。 But if you did set all of the elements AND the rhs was an integer then it needs to be coerced BACK to integer.
但是如果你确实设置了所有元素并且rhs是一个整数,那么它需要被强制转换为整数。
In this this particular case, the lhs is an array, so we would normally try to coerce the lhs to the type of the rhs, but this case degenerates if we have an unsafe conversion (int -> float) 在这个特殊情况下,lhs是一个数组,所以我们通常会尝试将lhs强制转换为rhs的类型,但如果我们有一个不安全的转换(int - > float),这种情况就会退化
Suffice to say this was a missing edge case. 我只想说这是一个缺失的边缘案例。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.