简体   繁体   English

根据 pandas 中的条件减去行中的值

[英]Subtract value in row based on condition in pandas

I need to subtract dates based on the progression of fault count.我需要根据故障计数的进展减去日期。 Below is the table that has the two input columns Date and Fault_Count .下面是具有两个输入列DateFault_Count的表。 The output columns I need are Option1 and Option2 .我需要的 output 列是Option1Option2 The last two columns show the date difference calculations.最后两列显示日期差异计算。 Basically when the Fault_Count changes I need to count the number of days from when the Fault_Count changed to the initial start of fault count.基本上,当 Fault_Count 更改时,我需要计算从 Fault_Count 更改到故障计数初始开始的天数。 For example the Fault_Count changed to 2 on 1/4/2020 , I need to get the number of days from when the Fault_Count started at 0 and changed to 2 (ie 1/4/2020 - 1/1/2020 = 3 ).例如 Fault_Count 在1/4/2020更改为 2 ,我需要获取从Fault_Count0开始并更改为 2 的天数(即1/4/2020 - 1/1/2020 = 3 )。

Date    Fault_Count Option1 Option2    Option1calc          Option2calc
1/1/2020       0       0        0       
1/2/2020       0       0        0       
1/3/2020       0       0        0       
1/4/2020       2       3        3   1/4/2020-1/1/2020    1/4/2020-1/1/2020
1/5/2020       2       0        0       
1/6/2020       2       0        0       
1/7/2020       4       3        3   1/7/2020-1/4/2020    1/7/2020-1/4/2020
1/8/2020       4       0        0       
1/9/2020       5       2        2   1/9/2020-1/7/2020    1/9/2020-1/7/2020
1/10/2020      5       0        0       
1/11/2020      0       2       -2   1/11/2020-1/9/2020   (1/11/2020-1/9/2020)*-1 as the fault resets
1/12/2020      1       1        1   1/12/2020-1/11/2020  1/12/2020-1/11/2020

Below is the code.下面是代码。

import pandas as pd

d = {'Date': ['1/1/2020', '1/2/2020', '1/3/2020', '1/4/2020', '1/5/2020', '1/6/2020', '1/7/2020', '1/8/2020', '1/9/2020', '1/10/2020', '1/11/2020', '1/12/2020'], 'Fault_Count' : [0, 0, 0, 2, 2, 2, 4, 4, 5, 5, 0, 1]}
df = pd.DataFrame(d)
df['Date'] = pd.to_datetime(df['Date'])
df['Fault_count_diff'] = df.Fault_Count.diff().fillna(0)
df['Cumlative_Sum'] = df.Fault_count_diff.cumsum()

I thought I could use cumulative sum and group by to get the groups and get the differences of the first value of groups.我想我可以使用累积和和分组来获取组并获得组的第一个值的差异。 That's as far as I could get, also I noticed that using cumulative sum was not giving me ordered groups as some of the Fault_Count get reset.据我所知,我还注意到使用累积总和并没有给我有序的组,因为一些Fault_Count被重置。

    Date        Fault_Count   Fault_count_diff    Cumlative_Sum
0   2020-01-01       0               0.0                0.0
1   2020-01-02       0               0.0                0.0
2   2020-01-03       0               0.0                0.0
3   2020-01-04       2               2.0                2.0
4   2020-01-05       2               0.0                2.0
5   2020-01-06       2               0.0                2.0
6   2020-01-07       4               2.0                4.0
7   2020-01-08       4               0.0                4.0
8   2020-01-09       5               1.0                5.0
9   2020-01-10       5               0.0                5.0
10  2020-01-11       0              -5.0                0.0
11  2020-01-12       1               1.0                1.0

Desired output:所需的 output:

         Date  Fault_Count  Option1  Option2
0  2020-01-01            0      0.0      0.0
1  2020-01-02            0      0.0      0.0
2  2020-01-03            0      0.0      0.0
3  2020-01-04            2      3.0      3.0
4  2020-01-05            2      0.0      0.0
5  2020-01-06            2      0.0      0.0
6  2020-01-07            4      3.0      3.0
7  2020-01-08            4      0.0      0.0
8  2020-01-09            5      2.0      2.0
9  2020-01-10            5      0.0      0.0
10 2020-01-11            0      2.0     -2.0
11 2020-01-12            1      1.0      1.0

Thanks for the help.谢谢您的帮助。

Instead of df['Fault_count_diff'] =... and the next line, do:而不是df['Fault_count_diff'] =...和下一行,做:

df['cycle'] = (df.Fault_Count.diff() < 0).cumsum()

Then to get the dates in between each count change.然后获取每个计数更改之间的日期。

Option1.选项1。 If all calendar dates are present in df:如果 df 中存在所有日历日期:

ndays = df.groupby(['cycle', 'Fault_Count']).Date.size()

Option2.选项 2。 If there's the possibility of a date not showing up in df and you still want to get the calendar days between incidents:如果日期有可能未显示在df中,并且您仍想获取事件之间的日历天数:

ndays = df.groupby(['cycle', 'Fault_Count']).Date.min().diff().dropna()

Use:利用:

m1 = df['Fault_Count'].ne(df['Fault_Count'].shift(fill_value=0))
m2 = df['Fault_Count'].eq(0) & df['Fault_Count'].shift(fill_value=0).ne(0)

s = df['Date'].groupby(m1.cumsum()).transform('first')

df['Option1'] = df['Date'].sub(s.shift()).dt.days.where(m1, 0)
df['Option2'] = df['Option1'].where(~m2, df['Option1'].mul(-1))

Details:细节:

Use Series.ne + Series.shift to create boolean mask m1 which represent the boundary condition when Fault_count changes, similarly use Series.eq + Series.shift and Series.ne to create a boolean mask m2 which represent the condition where Fault_count resets:使用Series.ne + Series.shift创建 boolean 掩码m1 ,表示Fault_count变化时的边界条件,同样使用Series.eq + Series.shiftSeries.ne创建 boolean 掩码m2 ,表示Fault_count重置的条件:

          m1           m2
0         False        False
1         False        False
2         False        False
3          True        False
4         False        False
5         False        False
6          True        False
7         False        False
8          True        False
9         False        False
10         True         True # --> Fault count reset
11         True        False

Use Series.groupby on consecutive fault counts obtained using m1.cumsum and transform the Date column using groupby.first :对使用Series.groupby获得的连续故障计数使用m1.cumsum并使用groupby.first转换Date列:

print(s)
0    2020-01-01
1    2020-01-01
2    2020-01-01
3    2020-01-04
4    2020-01-04
5    2020-01-04
6    2020-01-07
7    2020-01-07
8    2020-01-09
9    2020-01-09
10   2020-01-11
11   2020-01-12
Name: Date, dtype: datetime64[ns]

Use Series.sub to subtract Date for s shifted using Series.shift and use Series.where to fill 0 based on mask m2 and assign this to Option1 .使用Series.sub减去Date使用Series.shift移位的s并使用Series.where根据掩码m2填充0并将其分配给Option1 Similary we obtain Option2 from Option1 based on mask m2 :同样,我们根据掩码m2Option1获得Option2

print(df)
         Date  Fault_Count  Option1  Option2
0  2020-01-01            0      0.0      0.0
1  2020-01-02            0      0.0      0.0
2  2020-01-03            0      0.0      0.0
3  2020-01-04            2      3.0      3.0
4  2020-01-05            2      0.0      0.0
5  2020-01-06            2      0.0      0.0
6  2020-01-07            4      3.0      3.0
7  2020-01-08            4      0.0      0.0
8  2020-01-09            5      2.0      2.0
9  2020-01-10            5      0.0      0.0
10 2020-01-11            0      2.0     -2.0
11 2020-01-12            1      1.0      1.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM