[英]Pandas groupby difference between 2 most recent dates
熊猫版本:1.1.0
您好,我正在尝试找出 data_collected 的最后 2 个日期之间的值差异。 虽然代码在处理连续日期时工作正常,但由于没有收集数据,因此有 2 天的时间间隔,因此我一直在思考如何处理周末。 这会导致 groupby.diff() 忽略它们。
使用以下代码,我能够切出 2 个特定日期并使用 .reindex_like() 获得结果
# (working example when specifically slicing on 2 dates)
prior_date = df.loc[df['date_collected'] == '2020-11-06']
current = df.loc[df['date_collected'] == '2020-11-09']
prior_date = prior_date.set_index('date')['value']
current = current.set_index('date')['value']
prior_date = prior_date.reindex_like(current).fillna(0)
df = (current - prior_date).reset_index()
change = df[df['value'] != 0].dropna(axis=0)
但是,当我尝试为整个数据帧模拟这些结果时,我无法找到在多索引上使用 reindex_like 的方法。 尝试使用 .last() 但随后意识到缺少周末成为一个问题。
# (current result down below)
chng = df.set_index(['date_collected', 'date'])['value']
chng = chng.groupby(level=1).diff().reset_index()
last = df.groupby('date_collected')[['date', 'value']].last().reset_index()
chng = chng.set_index(['date_collected', 'value'])
last = last.set_index(['date_collected', 'value'])
chng = chng.fillna(last)
chng = chng[chng['value'] != 0].dropna()
# input data
+----------------+------------+-------+
| date_collected | date | value |
+----------------+------------+-------+
| 2020-11-06 | 2020-11-01 | 4 |
| 2020-11-06 | 2020-11-02 | 5 |
| 2020-11-06 | 2020-11-03 | 1 |
| 2020-11-06 | 2020-11-04 | 3 |
| 2020-11-06 | 2020-11-05 | 1 |
| 2020-11-09 | 2020-11-04 | 3 |
| 2020-11-09 | 2020-11-05 | 3 |
| 2020-11-09 | 2020-11-06 | 5 |
| 2020-11-09 | 2020-11-07 | 1 |
| 2020-11-09 | 2020-11-08 | 1 |
| 2020-11-10 | 2020-11-05 | 3 |
| 2020-11-10 | 2020-11-06 | 5 |
| 2020-11-10 | 2020-11-07 | 1 |
| 2020-11-10 | 2020-11-08 | 3 |
| 2020-11-10 | 2020-11-09 | 2 |
+----------------+------------+-------+
# wanted results
+----------------+------------+-------+
| date_collected | date | value |
+----------------+------------+-------+
| 2020-11-06 | 2020-11-05 | 1 |
| 2020-11-09 | 2020-11-05 | 2 |
| 2020-11-09 | 2020-11-06 | 5 |
| 2020-11-09 | 2020-11-07 | 1 |
| 2020-11-09 | 2020-11-08 | 1 |
| 2020-11-10 | 2020-11-08 | 2 |
| 2020-11-10 | 2020-11-09 | 2 |
+----------------+------------+-------+
# current results
+----------------+------------+-------+
| date_collected | date | value |
+----------------+------------+-------+
| 2020-11-06 | 2020-11-05 | 1 |
| 2020-11-09 | 2020-11-05 | 2 |
| 2020-11-09 | 2020-11-08 | 1 |
| 2020-11-10 | 2020-11-08 | 2 |
| 2020-11-10 | 2020-11-09 | 2 |
+----------------+------------+-------+
能够找出重新索引多索引并使用以下代码获得想要的结果:
dates = pd.date_range(df['date'].min(), df['date'].max())
new_idx = pd.MultiIndex.from_product([df['date_collected'].unique(), dates])
df = df.set_index(['date_collected', 'date'])
df = df.reindex(new_idx).fillna(0)
chng = df.groupby(level=1).diff()
chng = chng[chng['value'] != 0].dropna()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.