简体   繁体   English

根据行和列条件设置熊猫数据框值

[英]Setting pandas dataframe value based on row and column conditions

I have a fairly specific algorithm I want to follow. 我有一个特定的算法要遵循。

Basically I have a dataframe as follows: 基本上我有一个数据框,如下所示:

        month   taken   score
1       1       2       23
2       1       1       34
3       1       2       12
4       1       2       59
5       2       1       12
6       2       2       23
7       2       1       43
8       2       2       45
9       3       1       43
10      3       2       43
11      4       1       23
12      4       2       94

I want to make it so that the 'score' column is changed to 100 on days where taken == 2 continuously until the end of that month. 我想做到这一点,以使“得分”列在连续== 2的日期更改为100,直到该月底。 So, not all occurrences of taken == 2 have their score set to 100, if any day following during that month has a taken == 1. 因此,如果该月之后的任何一天内的taked == 1,那么并非所有出现的taked == 2的得分都设置为100。

So the result I'd want is: 所以我想要的结果是:

        month   taken   score
1       1       2       23
2       1       1       34
3       1       2       100
4       1       2       100
5       2       1       12
6       2       2       23
7       2       1       43
8       2       2       100
9       3       1       43
10      3       2       43
11      3       1       23
12      3       2       100
13      4       1       32
14      4       2       100

I've written this code which I feel should do it: 我写了这段代码,我认为应该这样做:

#iterate through months
for month in range(12):
    #iterate through scores
    for score in range(len(df_report.loc[df_report['month'] == month+1])):
        #starting from the bottom, of that month, if 'taken' == 2...
        if df_report.loc[df_report.month==month+1, 'taken'].iloc[-score-1] == 2:
            #then set the score to 100
            df_report.loc[df_report.month==month+1, 'score'].iloc[-score-2] = 100
        #if you run into a 'taken' == 1, move on to next month
        else: break

However, this doesn't appear to change any values, despite not throwing an error... it also doesn't give me an error about setting values to a copied dataframe. 但是,尽管没有抛出错误,但这似乎并没有改变任何值。它也没有给我关于将值设置为复制的数据帧的错误。

Could anyone explain what I'm doing wrong? 谁能解释我在做什么错?

The reason for your values not being updated is that assignment to iloc updates the copy returned by the preceding loc call, so the original is not touched. 您的值未更新的原因是分配给iloc更新前一个loc调用返回的副本 ,因此不会触动原始副本


Here's how I'd tackle this. 这是我要解决的方法。 First, define a function foo . 首先,定义一个函数foo

def foo(df):
    for i in reversed(df.index):
        if df.loc[i, 'taken'] != 2:
            break
        df.loc[i, 'score'] = 100
        i -= 1
    return df

Now, groupby month and call foo : 现在, month groupby并调用foo

df = df.groupby('month').apply(foo)
print(df) 
    month  taken  score
1       1      2     23
2       1      1     34
3       1      2    100
4       1      2    100
5       2      1     12
6       2      2     23
7       2      1     43
8       2      2    100
9       3      1     43
10      3      2    100
11      4      1     23
12      4      2    100

Obviously, apply has its shortcomings, but I cannot think of a vectorised approach to this problem. 显然, apply有其缺点,但是我无法想到针对此问题的矢量化方法。

You can do 你可以做

import numpy as np
def get_value(x):
    s = x['taken']
    # Get a mask of duplicate sequeence and change values using np.where
    mask = s.ne(s.shift()).cumsum().duplicated(keep=False)
    news = np.where(mask,100,x['score'])

    # if last number is 2 then change the news value to 100
    if s[s.idxmax()] == 2: news[-1] = 100 
    return pd.Series(news)

df['score'] = df.groupby('month').apply(get_value).values

Output : 输出:

month  taken  score
1       1      2     23
2       1      1     34
3       1      2    100
4       1      2    100
5       2      1     12
6       2      2     23
7       2      1     43
8       2      2    100
9       3      1     43
10      3      2    100
11      4      1     23
12      4      2    100

Almost identical speed but @coldspeed is winner 几乎相同的速度,但@coldspeed是赢家

ndf = pd.concat([df]*10000).reset_index(drop=True)

%%timeit
ndf['score'] = ndf.groupby('month').apply(foo)
10 loops, best of 3: 40.8 ms per loop


%%timeit  
ndf['score'] = ndf.groupby('month').apply(get_value).values
10 loops, best of 3: 42.6 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM