简体   繁体   中英

Python 3.6: Pandas - Obtain earlier occurence of an entry along column of a DataFrame

I have a pandas data frame with two columns - "Day" and "Event". On each day, an arbitrary number of events can occur, each denoted by a char. I want to construct an additional column indicating how many days have passed since an event last occurred. The code below can illustrate this:

>>> import pandas as pd
>>> event_data = pd.DataFrame({'Day':[1,1,1,1,2,2,2,3,3,4,4,4,4], 
                             'Event':['A','B','C','D','B','F','A','B','C','G','A','B','D']})
>>> print(event_data)

     Day Event
0     1     A
1     1     B
2     1     C
3     1     D
4     2     B
5     2     F
6     2     A
7     3     B
8     3     C
9     4     G
10    4     A
11    4     B
12    4     D

I want an additional column "days_since_event" which would look like:

    Day Event  Days_since_event
0     1     A                -1
1     1     B                -1
2     1     C                -1
3     1     D                -1
4     2     B                 1
5     2     F                -1
6     2     A                 1
7     3     B                 1
8     3     C                 2
9     4     G                -1
10    4     A                 2
11    4     B                 1
12    4     D                 3

Row 12 has a value 3 as event 'D' occurred on day 4 and the previous occurrence was on Day 1 implying that 3 days have passed since event 'D' last happened. Rows with value -1 indicate that the event is happening for the first time on that day.

What is a computationally optimum way of doing this? The original dataset has over 10 million rows and using loops or repeated data subsetting is too slow.

You could group the dataframe by Event column and calculate a diff of Day like this:

event_data['Days_since_event'] = event_data.groupby('Event')['Day']\
                                           .diff().fillna(-1).astype(int)

First occurrences of every event will have NaN values, so I fill those with -1 as described in your question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM