简体   繁体   中英

Finding duration between events

I want to compute the duration (in weeks between change). For example, p is the same for weeks 1,2,3 and changes to 1.11 in period 4. So duration is 3. Now the duration is computed in a loop ported from R. It works but it is slow. Any suggestion how to improve this would be greatly appreciated.

raw['duration']=np.nan
id=raw['unique_id'].unique()
for i in range(0,len(id)):
        pos1= abs(raw['dp'])>0
        pos2= raw['unique_id']==id[i]
        pos= np.where(pos1 & pos2)[0]
        raw['duration'][pos[0]]=raw['week'][pos[0]]-1
        for j in  range(1,len(pos)):
            raw['duration'][pos[j]]=raw['week'][pos[j]]-raw['week'][pos[j-1]]

The dataframe is raw, and values for a particular unique_id looks like this.

date         week p    change    duration
2006-07-08    27  1.05 -0.07         1
2006-07-15    28  1.05  0.00       NaN
2006-07-22    29  1.05  0.00       NaN
2006-07-29    30  1.11  0.06         3
...          ...   ...   ...       ...
2010-06-05   231  1.61  0.09         1
2010-06-12   232  1.63  0.02         1
2010-06-19   233  1.57 -0.06         1
2010-06-26   234  1.41 -0.16         1
2010-07-03   235  1.35 -0.06         1
2010-07-10   236  1.43  0.08         1
2010-07-17   237  1.59  0.16         1
2010-07-24   238  1.59  0.00       NaN
2010-07-31   239  1.59  0.00       NaN
2010-08-07   240  1.59  0.00       NaN
2010-08-14   241  1.59  0.00       NaN
2010-08-21   242  1.61  0.02         5

##

Computing duratiosn once you have your list in date order is trivial: iterate over the list, keeping track of how long since the last change to p . If the slowness comes from how you get that list, you haven't provided nearly enough info for help with that.

You can simply get the list of weeks where there is a change, then compute their differences, and finally join those differences back onto your original DataFrame.

weeks = raw.query('change != 0.0')[['week']]
weeks['duration'] = weeks.week.diff()
pd.merge(raw, weeks, on='week', how='left')
raw2=raw.ix[raw['change'] !=0,['week','unique_id']]
data2=raw2.groupby('unique_id')
raw2['duration']=data2['week'].transform(lambda x: x.diff())
raw2.drop('unique_id',1)
raw=pd.merge(raw,raw2,on=['unique_id','week'],how='left')

Thank you all. I modified the suggestion and got this to give the same answer as the complicated loop. For 10,000. observations, it is not a whole lot faster but the code seems more compact.

I put no change to Nan because the duration seems to be undefined when no change is made. But zero will work too. With the above code, the NaN is put in automatically by merge. In any case, I want to compute statistics for the non-change group separately.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM