Take the differences between groups of varying size in pandas groupby

Question

I need to calculate the differences between consecutive time groups in data like the following

from io import StringIO

import pandas as pd

strio = StringIO("""\
               date  feat1         feat2  value
2016-10-15T00:00:00      1             1    0.0
2016-10-15T00:00:00      1             2    1.0
2016-10-15T00:00:00      2             1    2.0
2016-10-15T00:00:00      2             2    3.0
2016-10-15T00:01:00      1             1    8.0
2016-10-15T00:01:00      1             2    5.0
2016-10-15T00:02:00      1             1    8.0
2016-10-15T00:02:00      1             2   12.0
2016-10-15T00:02:00      2             1   10.0
2016-10-15T00:02:00      2             2   11.0
2016-10-15T00:03:00      1             1   12.0
2016-10-15T00:03:00      1             2   13.0
2016-10-15T00:03:00      2             1   14.0
2016-10-15T00:03:00      2             2   15.0""")

I can do this using xarray library

df = pd.read_table(strio, sep='\s+')
dims = df.columns.values[:3].tolist()
df.set_index(dims, inplace=True) # needed to convert to xarray dataset
dataset = df.to_xarray()
diff_time = dataset.diff(dim=dims[0]) # take the diff in time
print(diff_time.to_dataframe().reset_index())

prints

                   date  feat1  feat2  value
0   2016-10-15T00:01:00      1      1    8.0
1   2016-10-15T00:01:00      1      2    4.0
2   2016-10-15T00:01:00      2      1    NaN
3   2016-10-15T00:01:00      2      2    NaN
4   2016-10-15T00:02:00      1      1    0.0
5   2016-10-15T00:02:00      1      2    7.0
6   2016-10-15T00:02:00      2      1    NaN
7   2016-10-15T00:02:00      2      2    NaN
8   2016-10-15T00:03:00      1      1    4.0
9   2016-10-15T00:03:00      1      2    1.0
10  2016-10-15T00:03:00      2      1    4.0
11  2016-10-15T00:03:00      2      2    4.0

So in time instant 2016-10-15T00:01:00 that I have feat1:2 missing the relevant diffs are nan

How can I do this in pure pandas in a vectorized way? Constructing the original dataframe with nan fill-ins (so groups are equally sized) is an option but rather avoided

A clumsy way to do it would be:

dfs = []
for k, v in zip(itertools.islice(df.groupby(level=0).groups.values(), 1, None),
                df.groupby(level=0).groups.values()):
    # print(df.loc(axis=0)[k.values] , df.loc(axis=0)[v.values])
    diff = df.loc(axis=0)[k.values].reset_index(level=0, drop=True) - \
           df.loc(axis=0)[v.values].reset_index(level=0, drop=True)
    diff = pd.concat([diff], keys=[k.values[0][0]], names=['date'])
    dfs.append(diff)
print(pd.concat(dfs).reset_index())

It does print the same output but it is not vectorized

Answer 1

Updated solution:

df.unstack(0)['value']\
  .diff(axis=1)\
  .dropna(how='all', axis=1)\
  .unstack([0,1])\
  .rename('value')\
  .reset_index()

Output:

                   date  feat1  feat2  value
0   2016-10-15T00:01:00      1      1    8.0
1   2016-10-15T00:01:00      1      2    4.0
2   2016-10-15T00:01:00      2      1    NaN
3   2016-10-15T00:01:00      2      2    NaN
4   2016-10-15T00:02:00      1      1    0.0
5   2016-10-15T00:02:00      1      2    7.0
6   2016-10-15T00:02:00      2      1    NaN
7   2016-10-15T00:02:00      2      2    NaN
8   2016-10-15T00:03:00      1      1    4.0
9   2016-10-15T00:03:00      1      2    1.0
10  2016-10-15T00:03:00      2      1    4.0
11  2016-10-15T00:03:00      2      2    4.0

Details:

After creating a three level MultiIndex, first let's unstack level 0, date, which moves dates from rows to columns, then use diff on columns, lastly drop the the first date using dropna where the whole column is nan and unstack feat1 and feat2 to recreate multiindex and convert back to dataframe.

Take the differences between groups of varying size in pandas groupby

Question

1 answers

solution1
2 ACCPTED 2019-01-28 15:45:01

Updated solution:

Take the differences between groups of varying size in pandas groupby

Question

1 answers

solution1 2 ACCPTED 2019-01-28 15:45:01

Updated solution:

solution1
2 ACCPTED 2019-01-28 15:45:01