[英]Take the differences between groups of varying size in pandas groupby
我需要計算數據中連續時間組之間的差異,如下所示
from io import StringIO
import pandas as pd
strio = StringIO("""\
date feat1 feat2 value
2016-10-15T00:00:00 1 1 0.0
2016-10-15T00:00:00 1 2 1.0
2016-10-15T00:00:00 2 1 2.0
2016-10-15T00:00:00 2 2 3.0
2016-10-15T00:01:00 1 1 8.0
2016-10-15T00:01:00 1 2 5.0
2016-10-15T00:02:00 1 1 8.0
2016-10-15T00:02:00 1 2 12.0
2016-10-15T00:02:00 2 1 10.0
2016-10-15T00:02:00 2 2 11.0
2016-10-15T00:03:00 1 1 12.0
2016-10-15T00:03:00 1 2 13.0
2016-10-15T00:03:00 2 1 14.0
2016-10-15T00:03:00 2 2 15.0""")
我可以使用xarray
庫做到這xarray
df = pd.read_table(strio, sep='\s+')
dims = df.columns.values[:3].tolist()
df.set_index(dims, inplace=True) # needed to convert to xarray dataset
dataset = df.to_xarray()
diff_time = dataset.diff(dim=dims[0]) # take the diff in time
print(diff_time.to_dataframe().reset_index())
版畫
date feat1 feat2 value
0 2016-10-15T00:01:00 1 1 8.0
1 2016-10-15T00:01:00 1 2 4.0
2 2016-10-15T00:01:00 2 1 NaN
3 2016-10-15T00:01:00 2 2 NaN
4 2016-10-15T00:02:00 1 1 0.0
5 2016-10-15T00:02:00 1 2 7.0
6 2016-10-15T00:02:00 2 1 NaN
7 2016-10-15T00:02:00 2 2 NaN
8 2016-10-15T00:03:00 1 1 4.0
9 2016-10-15T00:03:00 1 2 1.0
10 2016-10-15T00:03:00 2 1 4.0
11 2016-10-15T00:03:00 2 2 4.0
所以在及時2016-10-15T00:01:00我有feat1:2缺少相關的差異是難的
如何以向量化方式在純熊貓中做到這一點? 可以選擇使用nan填充來構造原始數據框(因此,組的大小相等),但是可以避免
一種笨拙的方法是:
dfs = []
for k, v in zip(itertools.islice(df.groupby(level=0).groups.values(), 1, None),
df.groupby(level=0).groups.values()):
# print(df.loc(axis=0)[k.values] , df.loc(axis=0)[v.values])
diff = df.loc(axis=0)[k.values].reset_index(level=0, drop=True) - \
df.loc(axis=0)[v.values].reset_index(level=0, drop=True)
diff = pd.concat([diff], keys=[k.values[0][0]], names=['date'])
dfs.append(diff)
print(pd.concat(dfs).reset_index())
它確實輸出相同的輸出,但未矢量化
df.unstack(0)['value']\
.diff(axis=1)\
.dropna(how='all', axis=1)\
.unstack([0,1])\
.rename('value')\
.reset_index()
輸出:
date feat1 feat2 value
0 2016-10-15T00:01:00 1 1 8.0
1 2016-10-15T00:01:00 1 2 4.0
2 2016-10-15T00:01:00 2 1 NaN
3 2016-10-15T00:01:00 2 2 NaN
4 2016-10-15T00:02:00 1 1 0.0
5 2016-10-15T00:02:00 1 2 7.0
6 2016-10-15T00:02:00 2 1 NaN
7 2016-10-15T00:02:00 2 2 NaN
8 2016-10-15T00:03:00 1 1 4.0
9 2016-10-15T00:03:00 1 2 1.0
10 2016-10-15T00:03:00 2 1 4.0
11 2016-10-15T00:03:00 2 2 4.0
細節:
在創建三級MultiIndex之后,首先讓我們拆開0級日期,將日期從行移動到列,然后在列上使用diff,最后使用dropna刪除第一個日期,其中整列為nan並取消堆疊feat1和feat2重新創建multiindex並轉換回dataframe。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.