简体   繁体   中英

Resampling timedelta for multivariate data in python pandas

Still new to Pandas:

I have some data (df) which is time series data for different persons and their variables.

The dataframe looks like this:

date1                 date2                 person  var1   var2
2016-01-04 12:56:00   2016-01-05 00:00:00   1       100    200   
2016-01-04 15:11:00   2016-01-05 00:00:00   1       110    120
2016-01-06 18:05:00   2016-01-06 00:00:00   2       220    300
2016-01-07 18:59:00   2016-01-07 00:00:00   2       200    100
2016-01-02 19:30:00   2016-01-02 00:00:00   3       400    500

For each person I'm trying to create a dataframe where:

  • The index is the timedelta in days between the dates
  • For each person observations are upsampled so observations for each person has the same length, between -5 and 5 days.
  • For observations collected on the same day, the mean of variables are taken
  • Missing values filled with 0

so it looks like this:

days   person  var1   var2
-5     1        0      0   
-4     1        0      0
-3     1        0      0
-2     1        0      0
-1     1        X      X
0      1        Y      Y
...
-5     2        0      0
-4     2        0      0

etc

I've tried using a timedelta index, using a groupby('person') and resampling but I don't know how to get the specified date ranges (-5 to 5) for each person or how to handle the multiple observations.

Pointers very much appreciated, and thank you.

First, prepare for the full list of persons and interested day difference

persons = df['person'].unique()
date_diffs = np.arange(-5,6)

Then, build a list of all possible combinations for your 'upsampling'.

person_date_diff_combos = [(person, date_diff) for person in persons for date_diff in date_diffs]

Finally, to take average as you need, we first group data by person and by date_diff , take the mean for both variables, then reindex to create the missing rows, and finally fill the missing rows with zeros.

df = df.groupby([
    'person',
    (df['date2'] - df['date1']).dt.days.rename('date_diff'),
])\
    [['var1', 'var2']].mean()\
    .reindex(person_date_diff_combos, axis=0)\
    .fillna(0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM