Pandas 群來自 ewm

Question

我已經標記了事件（時間序列）數據，其中事件以隨機間隔發生給定 label。我想計算組內 ewma 並將其作為新列“X1_EWMA”添加到 dataframe。 到目前為止，這是代碼：

import pandas as pd
import numpy as np
import altair as alt

n = 1000
df = pd.DataFrame({
    'T': pd.date_range('20190101', periods=n, freq='H'),
    'C1': np.random.choice(list('PYTHON'), n),
    'C2': np.random.choice(list('FUN'), n),
    'X1': np.random.randn(n),
    'X2': 100 + 10 * np.random.randn(n)
})

ts = df.set_index('T')

display(df.head())
display(ts.head())

感謝SO: Pandas Groupby 和自定義 function 應用方法），我能夠計算分組的 EWMA：

ewm = ts.groupby(['C1']).apply(lambda x: x['X1'].ewm(halflife=10).mean())
ewm.head()

它產生一個系列，由一個分類變量和日期時間索引。 該系列的長度與原始dataframe和時間系列（df和ts）相同

現在我想我可以通過加入行索引（假設排序順序沒有改變）來做一些體操，讓它回到原來的 dataframe (df)，但這似乎不對，甚至可能是一種冒險的方法，因為 groupby 僅在一個分類標簽中 - 我需要小心並進行一些檢查/排序/重新索引。

似乎應該有一種更簡單的方法可以將時間序列列直接添加到 dataframe (df) 或時間序列 (ts)，而無需創建單獨的序列或數據幀並加入它們。 如果我想添加滾動統計信息，情況也是如此，例如：

ts.groupby('C1').rolling(10).mean()

在此先感謝您的幫助或意見。

基於接受的答案的結果：

import pandas as pd
import numpy as np
import math
import altair as alt

alt.renderers.enable('notebook')      # for rendering in the notebook
alt.data_transformers.enable('json')  # for plotting data larger than 5000 points

# make a dataframe to test
n = 1000
df = pd.DataFrame({
    'T': pd.date_range('20190101', periods=n, freq='H'),
    'C1': np.random.choice(list('PYTHON'), n),
    'C2': np.random.choice(list('FUN'), n),
    'X1': np.linspace(0, 2*math.pi, n),
    'X2': np.random.randn(n),
})

# add a new variable that is a function of X1, X2 + a random outlier probability
df['X3'] = 0.2 * df['X2'] + np.sin(df['X1']) + np.random.choice(a=[0, 2], size=n, p=[0.98, 0.02])

# make it a time series for later resampling use cases.
ts = df.set_index('T')

#  SOLUTION: Add the ewma line with groupby().transform().
ts['ewm'] = ts.groupby(['C1'])['X3'].transform(lambda x: x.ewm(halflife=1).mean())

# plot the points and ewma using altair faceting and layering
points = alt.Chart().mark_circle(size=20, opacity=0.9).encode(
    x = 'T', 
    y = 'X3',
    color = 'C2',
).properties(width=270, height=170)

lines = alt.Chart().mark_line(size=1, color='red', opacity=1).encode(
    x = 'T', 
    y = 'ewm'
)

alt.layer(points, lines).facet(facet='C1', data=ts.reset_index()).properties(columns=3)

Answer 1

讓我們使用transform來解決這個問題：

t['ewm'] = ts.groupby(['C1'])['X1'].transform(lambda x: x.ewm(halflife=10).mean()).values()

Answer 2

你能試試這個嗎？ 不要設置ts = df.set_index('T') 。 然后你可以做如下

ts['ewm']=ts.groupby(['C1'], sort=False).apply(lambda x: x['X1'].ewm(halflife=10).mean()).reset_index(drop=True)

Answer 3

對於大型數據集，接受的答案非常慢。

我所做的是：

ts['ewm'] = ts.groupby(['C1']).ewm(halflife=10).mean().values

它似乎工作得很好

Pandas 群來自 ewm

問題描述

3 個解決方案

解決方案1
8 已采納 2019-09-19 02:08:03

解決方案2
0 2019-09-19 02:07:57

解決方案3
0 2022-05-22 10:05:19

Pandas 群來自 ewm

問題描述

3 個解決方案

解決方案1 8 已采納 2019-09-19 02:08:03

解決方案2 0 2019-09-19 02:07:57

解決方案3 0 2022-05-22 10:05:19

解決方案1
8 已采納 2019-09-19 02:08:03

解決方案2
0 2019-09-19 02:07:57

解決方案3
0 2022-05-22 10:05:19