[英]How do I update value in DataFrame with mask when iterating through rows
With the below code I'm trying to update the column df_test['placed']
to = 1 when the if statement is triggered and a prediction is placed. 使用下面的代码我试图在触发if语句并放置预测时将列
df_test['placed']
为= 1。 I haven't been able to get this to update correctly though, the code compiles but doesn't update to = 1 for the respective predictions placed. 我无法正确更新,但代码编译但不会更新为= 1表示相应的预测。
df_test['placed'] = np.zeros(len(df_test))
for i in set(df_test['id']) :
mask = df_test['id']==i
predictions = lm.predict(X_test[mask])
j = np.argmax(predictions)
if predictions[j] > 0 :
df_test['placed'][mask][j] = 1
print(df_test['placed'][mask][j])
Edit: changed suggestion based on comments 编辑:根据评论更改建议
The assignment part of your code, df_test['placed'][mask][j] = 1
, uses what is called chained indexing . 代码的赋值部分
df_test['placed'][mask][j] = 1
,使用所谓的链式索引 。 In short, your assignment only changes a temporary copy of the DataFrame that gets immediately thrown away, and never changes the original DataFrame. 简而言之,您的作业只会更改立即丢弃的DataFrame的临时副本 ,并且永远不会更改原始DataFrame。
To avoid this, the rule of thumb when doing assignment is: use only one set of square braces on a single DataFrame. 为避免这种情况,执行赋值时的经验法则是:在单个DataFrame上仅使用一组方括号 。 For your problem, that should look like:
对于您的问题,这应该是这样的:
df_test.loc[mask.nonzero()[0][j], 'placed'] = 1
(I know the mask.nonzero()
uses two sets of square brackets; actually nonzero()
returns a tuple, and the first element of that tuple is an ndarray. But the dataframe only uses one set, and that's the important part.) (我知道
mask.nonzero()
使用两组方括号;实际上nonzero()
返回一个元组,该元组的第一个元素是一个ndarray。但数据帧只使用一个集合,这是重要的部分。)
There are a couple notes I have on using pandas
(& numpy
). 我使用
pandas
(& numpy
)时有几个笔记。
Pandas & NumPy both have a feature called broadcasting . Pandas&NumPy都有一个叫做广播的功能。 Basically, if you're assigning a single value to an entire array, you don't need to make an array of the same size first;
基本上,如果要为整个数组分配单个值,则不需要先创建相同大小的数组; you can just assign the single value, and pandas/NumPy automagically figures out for you how to apply it.
你可以只分配单个值,pandas / NumPy会自动为你找出如何应用它。 So the first line of your code can be replaced with
df_test['placed'] = 0
, and it accomplishes the same thing. 所以你的代码的第一行可以用
df_test['placed'] = 0
代替,它完成同样的事情。
Generally speaking when working with pandas & numpy objects, loops are bad ; 一般来说 ,使用pandas和numpy对象时, 循环很糟糕 ; usually you can find a way to use some combination of broadcasting , element-wise operations and boolean indexing to do what a loop would do.
通常你可以找到一种方法来使用广播 , 元素操作和布尔索引的某种组合来完成循环。 And because of the way those features are designed, it'll run a lot faster too.
而且由于这些功能的设计方式,它的运行速度也会快得多。 Unfortunately I'm not familiar enough with the
lm.predict
method to say, but you might be able to avoid the whole for
-loop entirely for this code. 不幸的是,我对
lm.predict
方法不太熟悉,但你可能完全可以避免整个for
-loop这个代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.