![](/img/trans.png)
[英]Using pandas replace all empty values with last row based on previous month last value in a group by condition
[英]Pandas calculate the diff between all values in one group and the last value of the previous group
說我有一個熊貓數據框如下
df = pd.DataFrame({'val': [30, 40, 50, 60, 70, 80, 90], 'idx': [9, 8, 7, 6, 5, 4, 3],
'category': ['a', 'a', 'b', 'b', 'c', 'c', 'c']}).set_index('idx')
Ouput:
val category
idx
9 30 a
8 40 a
7 50 b
6 60 b
5 70 c
4 80 c
3 90 c
我想添加一個新列,其中每個 'val' 和上一個類別的最后一個 'val' 之間存在差異。 新列應如下所示:
category diff val
idx
9 a nan 30
8 a nan 40
7 b 10 50
6 b 20 60
5 c 10 70
4 c 20 80
3 c 30 90
目前我做這樣的事情:
temp_df = df.groupby('category')['val'].agg('last').rename('lastVal').shift()
df = df.merge(temp_df, on='date', how='outer', right_index=True)
df['diff'] = df['val'] - df['lastVal']
然而,它很慢。 有一個更好的方法嗎?
您可以通過首先將 category 設置為索引來將映射卸載到 Pandas:
df2 = df.set_index('category')
df['diff'] = (
df2['val'] - df.groupby('category')['val'].last().shift()).to_numpy()
df
val category diff
idx
9 30 a NaN
8 40 a NaN
7 50 b 10.0
6 60 b 20.0
5 70 c 10.0
4 80 c 20.0
3 90 c 30.0
這大約是速度的兩倍:
%%timeit
maxdf = df.groupby('category')['val'].last().shift()
df['diff'] = df['val'] - df['category'].map(maxdf.to_dict())
1.33 ms ± 20.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
與你的版本
%%timeit
temp_df = df.groupby('category')['val'].agg('last').rename('lastVal').shift()
df2 = df.merge(temp_df, on='category', how='outer', right_index=True)
df2['diff'] = df2['val'] - df2['lastVal']
2.79 ms ± 83.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.