简体   繁体   中英

Find length of streak in pandas

I have a pandas dataframe where a column depicts an integer time index, and I want to add a column that stores whether a row is part of a streak and how long the streak is. For example, given the time column, I would like to compute a streak column, like so

time    streak
0       3
1       3
2       3
4       2
5       2
5       2
9       1
11      1
11      1

The first three lines are part of a streak of three since indices 0,1,2 are contiguous. The following three lines have a streak of 2 since indices 4,5 are also contiguous; index 5 is repeated, but this shouldn't count when determining the length of a streak. Finally, the last three lines are not contiguous to anything else, so they have a streak of 1. Notice that sometimes more than one row can have the same time . I need to count the length of the streak in time units, so that multiple entries don't affect the length of the streak, and lines with the same time index have the same streak length. Bear in mind that other columns (not shown) are stored in the dataframe.

How do I get the value? I tried playing around with groupby , shift and similar functions, but didn't quite get very far.

EDIT: sorry, I forgot to specify that sometimes the time index can be repeated. I expanded on the question to keep this into account.

Using diff find the whether it continue or not (equal to 1 ), then cumsum with the condition match , then we using groupby + transform szie

s=df.time.diff().fillna(1).ne(1).cumsum()
s.groupby(s).transform('size')
Out[396]: 
0    3
1    3
2    3
3    2
4    2
5    1
6    1
Name: time, dtype: int32

Very similar to Wen's answer, just using value_counts which I feel is a tiny bit more pandorable.

time = pd.Series([0, 1, 2, 4, 5, 9, 11])

# Give each row a streak id by incrementing whenever the difference isn't 1
streak = (time.diff() != 1).cumsum()

# Maps each id to the number of times the id occurs
result = streak.map(streak.value_counts())

print(result)
Out:
0    3
1    3
2    3
3    2
4    2
5    1
6    1

Edit: here's a solution to cater for the new case added to the question where there may be duplicate times. Note we are using diff > 1 to find new streaks; this is using the assumption that the times are (not necessarily strictly now) increasing integers. The possible duplication just means we have to drop_duplicates before working out the streak id count for the mapping.

time = pd.Series([0, 1, 2, 4, 5, 5, 9, 11, 11])

result = (time.diff() > 1).cumsum().map(
    (time.drop_duplicates().diff() > 1).cumsum().value_counts()
)

print(result)
Out:
0    3
1    3
2    3
3    2
4    2
5    2
6    1
7    1
8    1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM