[英]Setting pandas dataframe value based on row and column conditions
I have a fairly specific algorithm I want to follow. 我有一个特定的算法要遵循。
Basically I have a dataframe as follows: 基本上我有一个数据框,如下所示:
month taken score
1 1 2 23
2 1 1 34
3 1 2 12
4 1 2 59
5 2 1 12
6 2 2 23
7 2 1 43
8 2 2 45
9 3 1 43
10 3 2 43
11 4 1 23
12 4 2 94
I want to make it so that the 'score' column is changed to 100 on days where taken == 2 continuously until the end of that month. 我想做到这一点,以使“得分”列在连续== 2的日期更改为100,直到该月底。 So, not all occurrences of taken == 2 have their score set to 100, if any day following during that month has a taken == 1. 因此,如果该月之后的任何一天内的taked == 1,那么并非所有出现的taked == 2的得分都设置为100。
So the result I'd want is: 所以我想要的结果是:
month taken score
1 1 2 23
2 1 1 34
3 1 2 100
4 1 2 100
5 2 1 12
6 2 2 23
7 2 1 43
8 2 2 100
9 3 1 43
10 3 2 43
11 3 1 23
12 3 2 100
13 4 1 32
14 4 2 100
I've written this code which I feel should do it: 我写了这段代码,我认为应该这样做:
#iterate through months
for month in range(12):
#iterate through scores
for score in range(len(df_report.loc[df_report['month'] == month+1])):
#starting from the bottom, of that month, if 'taken' == 2...
if df_report.loc[df_report.month==month+1, 'taken'].iloc[-score-1] == 2:
#then set the score to 100
df_report.loc[df_report.month==month+1, 'score'].iloc[-score-2] = 100
#if you run into a 'taken' == 1, move on to next month
else: break
However, this doesn't appear to change any values, despite not throwing an error... it also doesn't give me an error about setting values to a copied dataframe. 但是,尽管没有抛出错误,但这似乎并没有改变任何值。它也没有给我关于将值设置为复制的数据帧的错误。
Could anyone explain what I'm doing wrong? 谁能解释我在做什么错?
The reason for your values not being updated is that assignment to iloc
updates the copy returned by the preceding loc
call, so the original is not touched. 您的值未更新的原因是分配给iloc
更新前一个loc
调用返回的副本 ,因此不会触动原始副本 。
Here's how I'd tackle this. 这是我要解决的方法。 First, define a function foo
. 首先,定义一个函数foo
。
def foo(df):
for i in reversed(df.index):
if df.loc[i, 'taken'] != 2:
break
df.loc[i, 'score'] = 100
i -= 1
return df
Now, groupby
month
and call foo
: 现在, month
groupby
并调用foo
:
df = df.groupby('month').apply(foo)
print(df)
month taken score
1 1 2 23
2 1 1 34
3 1 2 100
4 1 2 100
5 2 1 12
6 2 2 23
7 2 1 43
8 2 2 100
9 3 1 43
10 3 2 100
11 4 1 23
12 4 2 100
Obviously, apply
has its shortcomings, but I cannot think of a vectorised approach to this problem. 显然, apply
有其缺点,但是我无法想到针对此问题的矢量化方法。
You can do 你可以做
import numpy as np
def get_value(x):
s = x['taken']
# Get a mask of duplicate sequeence and change values using np.where
mask = s.ne(s.shift()).cumsum().duplicated(keep=False)
news = np.where(mask,100,x['score'])
# if last number is 2 then change the news value to 100
if s[s.idxmax()] == 2: news[-1] = 100
return pd.Series(news)
df['score'] = df.groupby('month').apply(get_value).values
Output : 输出:
month taken score 1 1 2 23 2 1 1 34 3 1 2 100 4 1 2 100 5 2 1 12 6 2 2 23 7 2 1 43 8 2 2 100 9 3 1 43 10 3 2 100 11 4 1 23 12 4 2 100
Almost identical speed but @coldspeed is winner 几乎相同的速度,但@coldspeed是赢家
ndf = pd.concat([df]*10000).reset_index(drop=True)
%%timeit
ndf['score'] = ndf.groupby('month').apply(foo)
10 loops, best of 3: 40.8 ms per loop
%%timeit
ndf['score'] = ndf.groupby('month').apply(get_value).values
10 loops, best of 3: 42.6 ms per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.