简体   繁体   English

如何使用 Pandas 与 groupby 一起滚动并删除重复项

[英]How to use Pandas rolling with groupby and removing duplicates

I have a table like this:我有一张这样的桌子:

date    ID    Value
1       aa     5.5
1       aa     5.5
1       bb     66
1       bb     66
2       cc     2.03
2       aa     0.1
2       aa     0.1
3       bb     7
4       dd     7
5       aa     4
5       aa     4

Some information about rows:关于行的一些信息:

Date - Same date can appear in more than one row
ID - Same ID can appear in more than one row
If ID and Date are same, then value will also be same.

I want to calculate rolling().mean() of Value column.我想计算Value列的rolling().mean() But I want to groupby date and ID , and calculate the rolling mean that, it does not take the mean of row with same date.但我想按dateID分组,并计算滚动平均值,它不采用具有相同日期的行的平均值。

So this do not work, because in this, it takes rolling mean of same date twice.所以这不起作用,因为在这种情况下,它需要两次相同日期的滚动平均值。

df.groupby(["ID","date"])["Value"].rolling(3).mean()

I have implemented a for-loop solution but it is way slower, and I am working on millions of rows.我已经实现了一个 for-loop 解决方案,但速度较慢,而且我正在处理数百万行。 This is my current solution.这是我目前的解决方案。

uniqueID = df["ID"]
for idname in uniqueID:
    temp = df.loc[df["ID"] == idname].drop_duplicates(inplace=False,keep="first",subset="date").set_index('date', inplace=False)["Value"].rolling(3).mean()
    df.loc[df["ID"] == idname,"rollingmeanCol"] = pd.merge(df.loc[df["ID"]==idname,["date"]],temp,on=["date"],how="left")["Value"].values

Any faster solution without loops?没有循环的任何更快的解决方案?

In the loop also, I am doing this query 3 times, any way to do this query 1 time only?在循环中,我正在执行此查询 3 次,有什么方法只执行此查询 1 次? df.loc[df["ID"] == idname]

Expected output (can be verified using the above loop code)预期的 output (可以使用上面的循环代码进行验证)

date   ID   Value   rollingmeanCol
1   aa     5.5        NaN
1   aa     5.5        NaN
1   bb     66.0       NaN
1   bb     66.0       NaN
2   cc     2.03       NaN
2   aa     0.1        NaN
2   aa     0.1        NaN
3   bb     7.0        NaN
4   dd     7.0        NaN
5   aa     4.0        3.1999999999999997
5   aa     4.0        3.1999999999999997

You can try this:你可以试试这个:

# Calculate rolling mean

rollingmeanCol = (
    df
    .drop_duplicates()
    .sort_values('date')
    .set_index('date')
    .groupby("ID")["Value"].rolling(3).mean()
    .rename('rollingmeanCol')
)

# Merge it with your df

df = df.merge(rollingmeanCol, on=['date','ID'])

Output is the same as your expected. Output 与您的预期相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM