简体   繁体   English

迭代行时,如何使用掩码更新DataFrame中的值

[英]How do I update value in DataFrame with mask when iterating through rows

With the below code I'm trying to update the column df_test['placed'] to = 1 when the if statement is triggered and a prediction is placed. 使用下面的代码我试图在触发if语句并放置预测时将列df_test['placed']为= 1。 I haven't been able to get this to update correctly though, the code compiles but doesn't update to = 1 for the respective predictions placed. 我无法正确更新,但代码编译但不会更新为= 1表示相应的预测。

df_test['placed'] = np.zeros(len(df_test))
for i in set(df_test['id']) :
    mask = df_test['id']==i
    predictions = lm.predict(X_test[mask])
    j = np.argmax(predictions)
    if predictions[j] > 0 :
        df_test['placed'][mask][j] = 1
        print(df_test['placed'][mask][j])

Answering your question 回答你的问题

Edit: changed suggestion based on comments 编辑:根据评论更改建议

The assignment part of your code, df_test['placed'][mask][j] = 1 , uses what is called chained indexing . 代码的赋值部分df_test['placed'][mask][j] = 1 ,使用所谓的链式索引 In short, your assignment only changes a temporary copy of the DataFrame that gets immediately thrown away, and never changes the original DataFrame. 简而言之,您的作业只会更改立即丢弃的DataFrame的临时副本 ,并且永远不会更改原始DataFrame。

To avoid this, the rule of thumb when doing assignment is: use only one set of square braces on a single DataFrame. 为避免这种情况,执行赋值时的经验法则是:在单个DataFrame上仅使用一组方括号 For your problem, that should look like: 对于您的问题,这应该是这样的:

df_test.loc[mask.nonzero()[0][j], 'placed'] = 1

(I know the mask.nonzero() uses two sets of square brackets; actually nonzero() returns a tuple, and the first element of that tuple is an ndarray. But the dataframe only uses one set, and that's the important part.) (我知道mask.nonzero()使用两组方括号;实际上nonzero()返回一个元组,该元组的第一个元素是一个ndarray。但数据帧只使用一个集合,这是重要的部分。)

Some other notes 其他一些说明

There are a couple notes I have on using pandas (& numpy ). 我使用pandas (& numpy )时有几个笔记。

  • Pandas & NumPy both have a feature called broadcasting . Pandas&NumPy都有一个叫做广播的功能。 Basically, if you're assigning a single value to an entire array, you don't need to make an array of the same size first; 基本上,如果要为整个数组分配单个值,则不需要先创建相同大小的数组; you can just assign the single value, and pandas/NumPy automagically figures out for you how to apply it. 你可以只分配单个值,pandas / NumPy会自动为你找出如何应用它。 So the first line of your code can be replaced with df_test['placed'] = 0 , and it accomplishes the same thing. 所以你的代码的第一行可以用df_test['placed'] = 0代替,它完成同样的事情。

  • Generally speaking when working with pandas & numpy objects, loops are bad ; 一般来说 ,使用pandas和numpy对象时, 循环很糟糕 ; usually you can find a way to use some combination of broadcasting , element-wise operations and boolean indexing to do what a loop would do. 通常你可以找到一种方法来使用广播元素操作布尔索引的某种组合来完成循环。 And because of the way those features are designed, it'll run a lot faster too. 而且由于这些功能的设计方式,它的运行速度也会快得多。 Unfortunately I'm not familiar enough with the lm.predict method to say, but you might be able to avoid the whole for -loop entirely for this code. 不幸的是,我对lm.predict方法不太熟悉,但你可能完全可以避免整个for -loop这个代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM