[英]How to use Pandas rolling with groupby and removing duplicates
I have a table like this:我有一张这样的桌子:
date ID Value
1 aa 5.5
1 aa 5.5
1 bb 66
1 bb 66
2 cc 2.03
2 aa 0.1
2 aa 0.1
3 bb 7
4 dd 7
5 aa 4
5 aa 4
Some information about rows:关于行的一些信息:
Date - Same date can appear in more than one row
ID - Same ID can appear in more than one row
If ID and Date are same, then value will also be same.
I want to calculate rolling().mean()
of Value
column.我想计算Value
列的rolling().mean()
。 But I want to groupby date
and ID
, and calculate the rolling mean that, it does not take the mean of row with same date.但我想按date
和ID
分组,并计算滚动平均值,它不采用具有相同日期的行的平均值。
So this do not work, because in this, it takes rolling mean of same date twice.所以这不起作用,因为在这种情况下,它需要两次相同日期的滚动平均值。
df.groupby(["ID","date"])["Value"].rolling(3).mean()
I have implemented a for-loop solution but it is way slower, and I am working on millions of rows.我已经实现了一个 for-loop 解决方案,但速度较慢,而且我正在处理数百万行。 This is my current solution.这是我目前的解决方案。
uniqueID = df["ID"]
for idname in uniqueID:
temp = df.loc[df["ID"] == idname].drop_duplicates(inplace=False,keep="first",subset="date").set_index('date', inplace=False)["Value"].rolling(3).mean()
df.loc[df["ID"] == idname,"rollingmeanCol"] = pd.merge(df.loc[df["ID"]==idname,["date"]],temp,on=["date"],how="left")["Value"].values
Any faster solution without loops?没有循环的任何更快的解决方案?
In the loop also, I am doing this query 3 times, any way to do this query 1 time only?在循环中,我正在执行此查询 3 次,有什么方法只执行此查询 1 次? df.loc[df["ID"] == idname]
Expected output (can be verified using the above loop code)预期的 output (可以使用上面的循环代码进行验证)
date ID Value rollingmeanCol
1 aa 5.5 NaN
1 aa 5.5 NaN
1 bb 66.0 NaN
1 bb 66.0 NaN
2 cc 2.03 NaN
2 aa 0.1 NaN
2 aa 0.1 NaN
3 bb 7.0 NaN
4 dd 7.0 NaN
5 aa 4.0 3.1999999999999997
5 aa 4.0 3.1999999999999997
You can try this:你可以试试这个:
# Calculate rolling mean
rollingmeanCol = (
df
.drop_duplicates()
.sort_values('date')
.set_index('date')
.groupby("ID")["Value"].rolling(3).mean()
.rename('rollingmeanCol')
)
# Merge it with your df
df = df.merge(rollingmeanCol, on=['date','ID'])
Output is the same as your expected. Output 与您的预期相同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.