简体   繁体   English

在 m 大小的窗口中查找最小 n 值的移动平均值

[英]Find moving average of the smallest n values in m sized window

I have data on the values of personal stocks like this:我有这样的个人股票价值数据:

UserId Stock Value    Time
1        APL  20  '2019-01-01'
1        MCR  40  '2019-01-01'
1        ADX  60  '2019-01-01'
3        AGL  10  '2019-01-01'
...

I have to group by users and for each stock x, I have to find the average of the value the 10 most valuable stocks in that user's 20 most recent stocks before stock x.我必须按用户分组,对于每只股票 x,我必须在股票 x 之前找到该用户最近 20 只股票中 10 只最有价值股票的价值的平均值。 Thus, I first group by the userId, then iterate through each stock x doing the following: select the user's 20 most recent stocks before stock x, further select the 10 most valuable stocks from that window, take the average and add it on in a new column for stock x.因此,我首先按 userId 分组,然后遍历每只股票 x 执行以下操作:选择用户在股票 x 之前最近的 20 只股票,进一步从该窗口中选择 10 只最有价值的股票,取平均值并将其添加到股票 x 的新列。 My dataset would look something like this after:我的数据集将如下所示:

UserId Stock Value    Time    MovingAverage
1        APL  20  '2019-01-01'     20
1        MCR  40  '2019-01-01'     30
1        ADX  60  '2019-01-01'     40
3        AGL  10  '2019-01-01'     10
...

So far, I have been trying to use rolling in Python as follows:到目前为止,我一直在尝试在 Python 中使用滚动,如下所示:

df = df.sort_values(['userId','time'], ascending=['true','false']) 
df['roll'] = df.groupby('userId')['Value'].transform(lambda x: x.rolling(20,1).mean())

I can't figure out how to get the mean of the 10 highest values in a window!我不知道如何获得窗口中 10 个最高值的平均值! I am not against using another technique than rolling, it was just what seemed like the most popular method.我并不反对使用滚压以外的其他技术,这似乎是最流行的方法。

Another issue is that some stocks will have less than 20 stocks before them, but I think using rolling(20,1) mitigates that issue.另一个问题是,某些股票之前的股票数量将少于 20 只,但我认为使用滚动 (20,1) 可以缓解该问题。 However, in the case that there are less than 10 stocks eg 8 stocks, I need to just get the average of the last 8 stocks.但是,在少于 10 只股票的情况下,例如 8 只股票,我只需要获得最后 8 只股票的平均值。

Figured it out.弄清楚了。 Posting in case anyone else is in a similar situation.发帖以防其他人处于类似情况。 I defined my own function and then simply used rolling.apply().我定义了自己的函数,然后简单地使用了rolling.apply()。 Ended up being fairly straightforward.最终变得相当简单。

First, I defined the function that would perform the behaviour described in the post above.首先,我定义了执行上述帖子中描述的行为的函数。

def gm(arr):
    if (arr.size > 10):
        x = np.partition(arr, 9).mean()
    else:
        x = arr.mean()
    return x

Then, rolling.apply() worked its magic:然后,rolling.apply() 发挥了它的魔力:

newcol = df.groupby('userId')['value'].rolling(20,1).apply(lambda x: gm(x), raw=True)
df['roll'] = newcol.reset_index(level=0, drop=True)

I am still not sure about the indexing at the end but the results seem to be what I want.我仍然不确定最后的索引,但结果似乎是我想要的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM