简体   繁体   中英

calculate diff() in python on subsets of data within a dataframe

I am new to Python and coming from SAS. I want to calculate a lag variable (time difference using diff()) between sequential rows, but I want to re-start the process every time I encounter a new individual. In SAS this is done using dif() or lag() using a by-command. Is there a similar way to do this using Python?

Here is what I want the data to look like (note the missing data each time I encounter a new value for PIT):

PIT Receiver    tottime     Lag
1   1   2015-01-21 12:00:00 
1   1   2015-01-21 12:00:05 5
1   1   2015-01-21 12:00:20 15
1   1   2015-01-21 12:00:30 10
1   1   2015-01-21 12:00:35 5
1   2   2015-01-22 12:00:35 86400
1   2   2015-01-22 12:00:50 15
1   2   2015-01-22 12:00:55 5
1   2   2015-01-22 12:01:05 10
1   2   2015-01-22 12:01:10 5
2   1   2015-01-12 12:01:10 
2   1   2015-01-12 12:01:15 5
2   2   2015-01-12 12:01:20 5
2   2   2015-01-12 12:01:25 5
2   2   2015-01-12 12:01:30 5

I tried this using this code:

Clean['tottime']=pd.to_datetime(Clean.tottime.values)   #Convert tottime to     datetime value
tindex=Clean.tottime.values                             #Create vector of time values that will become part of a multi-index
arrays = [Clean.PIT.values,tIndex]                      # Define arrays object, which contains both levels of the multi-index

index = pd.MultiIndex.from_arrays(arrays, names = ['PIT','tottime'])                # declare multi level index
Clean.index = index

Clean['lag'] = Clean.tottime.diff()                                     #    calculated difference in tottime between rows
Clean['lag'] = Clean['lag']/np.timedelta64(1,'s')                       #This converts 'lag' to a numeric (float64) value

But this produces something like this (ie works on first row, but then does not recognize the new PIT value):

PIT Receiver    tottime    Lag
1   1   2015-01-21 12:00:00 
1   1   2015-01-21 12:00:05 5
1   1   2015-01-21 12:00:20 15
1   1   2015-01-21 12:00:30 10
1   1   2015-01-21 12:00:35 5
1   2   2015-01-22 12:00:35 86400
1   2   2015-01-22 12:00:50 15
1   2   2015-01-22 12:00:55 5
1   2   2015-01-22 12:01:05 10
1   2   2015-01-22 12:01:10 5
2   1   2015-01-12 12:01:10 -864000
2   1   2015-01-12 12:01:15 5
2   2   2015-01-12 12:01:20 5
2   2   2015-01-12 12:01:25 5
2   2   2015-01-12 12:01:30 5

So it is failing to reset on the new PIT, and I get a big negative number (10 days previous). Eventually I want to be able to do this on PIT and Receiver., but for now the challenge is to iterate this process over tottime, grouped by PIT. Any suggestions for how to do this?

Also I suspect this is a subset of a common problem (by-processing), but I do not know how to phrase the question in Python-speak, so am not finding them on the StackOverflow site. Any guidance would be appreciated.

Thanks!

One way to do this is to use the pandas groupby() functionality.

This is a slightly cumbersome approach as I don't have your code, but you could try the following, assuming that your DataFrame is in the same format as you have shown, without the lag column.

First, create a function, diff_func , which will be applied to the groupby object.

def diff_func(df):
    return df.diff()

Then use the groupby() :

Clean['Lag'] = Clean.groupby('PIT')['tottime'].apply(diff_func)

The above line basically groups Clean by the column PIT , tells pandas to apply the function to the column tottime and then dumps it in the new column Lag .

so you're saying whenever you have a differing PIT from the previous row? that's easy:

df.loc[df.PIT != df.PIT.shift(1), 'Lag'] = 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM