简体   繁体   中英

Pandas - count streak since last value change

I want to get the streak since a value in a specific column changed:

datetime                 val1          val2  val1-streak   val2-streak
2018-04-01 00:00:00        4           1        0             0
2018-05-01 00:00:00        5           2        0             0
2018-06-01 00:00:00        5           2        1             1
2018-07-01 00:00:00        6           2        0             2
2018-08-01 00:00:00        7           2        0             3
2018-09-01 00:00:00        7           3        1             0
2018-10-01 00:00:00        7           3        2             1
2018-11-01 00:00:00        5           2        0             0

For now it is ok to assume that the dattime column is equaly spaced. That way i can just count the "periods" since the value in a target column changed. Ideally the function can additionally count the difference in datetinme since the value changed last.

I found a few solutions on this site but when i looked more closely they did not really describe my problem:

Pandas - count since last transaction

Counting changes in pandas row values?

In pandas, how do you find the time since a column by value changes, when grouped by another column?

Pythonic way to calculate streaks in pandas dataframe

finding streaks in pandas dataframe

https://joshdevlin.com/blog/calculate-streaks-in-pandas/

.... at least I was not able to transfer the given answers to my problem.

Use custom function for generate counter by consecutive values by compare differencies per columns specified in list by Series.diff , compare for not equal by Series.ne with cumulative sum, last pass to GroupBy.cumcount :

vals = ['val1','val2']

def f(x):
    x = x.diff().ne(0).cumsum()
    return x.groupby(x).cumcount()

df = df.join(df[vals].apply(f).add_suffix('_streak'))

Let's start with how to do it with a single column. First off we need to find each point where the value changes:

diffs = df['val1'].diff(1)
change_points = diffs != 0

Since previously encountered values can show up again we need to find a proper way to distinguish different streaks of the same value. We'll achieve that with a cumulative sum of the change point array. For ease we'll create a temporary DataFrame to hold those results (you can create a new column in your dataframe, but that's a bit messy)

change_points_cumsum = change_points.cumsum()
tmp_merged = pd.concat([serie, change_points_cumsum], axis=1, keys=['val1', 'change_points_cumsum'])

Every streak is assigned a different change_points_cumsum value so now a simple group by and cumulative count can be applied to get the final result

tmp_merged["val1-streak"] = tmp_merged.groupby(change_points_cumsum).cumcount()
print(tmp_merged)

   val1  change_points_cumsum  val1-streak 
0     4                     1             0
1     5                     2             0
2     5                     2             1
3     6                     3             0
4     7                     4             0
5     7                     4             1
6     7                     4             2
7     5                     5             0

If you are looking for a shorter, more compact solution

change_points_cumsum = df['val1'].diff(1).ne(0).cumsum()
change_points_cumsum.groupby(change_points_cumsum).cumcount()

0    0
1    0
2    1
3    0
4    0
5    1
6    2
7    0

Let's try the following:

(i) Create a stacked Series using set_index + unstack (this is to use not repeat the same function on two columns separately).

(ii) Use groupby + diff to find differences among consecutive elements: tmp

(iii) We're interested in the location where the differences are 0; we do another groupby + cumsum to get the desired outcome.

(iv) Use pivot to get the output back into the shape of df .

tmp = df.set_index('datetime')[['val1','val2']].unstack().groupby(level=0).diff()
df[['val1-streak','val2-streak']] = pd.pivot(tmp.eq(0)
                                             .groupby([tmp.index.get_level_values(0), 
                                                       tmp.ne(0).cumsum()])
                                             .cumsum()
                                             .reset_index(), 
                                             'datetime', 'level_0', 0).to_numpy()

Output:

              datetime  val1  val2  val1-streak  val2-streak
0  2018-04-01 00:00:00     4     1            0            0
1  2018-05-01 00:00:00     5     2            0            0
2  2018-06-01 00:00:00     5     2            1            1
3  2018-07-01 00:00:00     6     2            0            2
4  2018-08-01 00:00:00     7     2            0            3
5  2018-09-01 00:00:00     7     3            1            0
6  2018-10-01 00:00:00     7     3            2            1
7  2018-11-01 00:00:00     5     2            0            0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM