简体   繁体   中英

Pandas groupby ewm

I have labeled event (time series) data where the events occur at random intervals for a given label. I would like to compute the within group ewma and add it to the dataframe as a new column "X1_EWMA". Here's the code so far:

import pandas as pd
import numpy as np
import altair as alt

n = 1000
df = pd.DataFrame({
    'T': pd.date_range('20190101', periods=n, freq='H'),
    'C1': np.random.choice(list('PYTHON'), n),
    'C2': np.random.choice(list('FUN'), n),
    'X1': np.random.randn(n),
    'X2': 100 + 10 * np.random.randn(n)
})

ts = df.set_index('T')

display(df.head())
display(ts.head())

Thanks to SO: Pandas Groupby and apply method with custom function ), I am able to compute the grouped EWMA with:

ewm = ts.groupby(['C1']).apply(lambda x: x['X1'].ewm(halflife=10).mean())
ewm.head()

It produces a series, indexed by one of the categorical variables and the datetime. The length of the series is the same as the original dataframe and times series (df and ts)

Now I think I could do some gymnastics to get this joined back together to the original dataframe (df) by joining on row index (assuming sort orders didn't change), but that does not seem right, and may even be a risky approach, since the groupby was within only one of the categorical labels - I'd need to be careful and do some checks/sorts/re-indexing.

Seems like there should be an easier method to add a times series column directly to the dataframe (df) or time series (ts) without creating separate series or dataframes and joining them. Same would be true if I wanted to add a rolling statistic, such as:

ts.groupby('C1').rolling(10).mean()

Thanks in advance for any help or inputs.

Results based on accepted answer:

import pandas as pd
import numpy as np
import math
import altair as alt

alt.renderers.enable('notebook')      # for rendering in the notebook
alt.data_transformers.enable('json')  # for plotting data larger than 5000 points

# make a dataframe to test
n = 1000
df = pd.DataFrame({
    'T': pd.date_range('20190101', periods=n, freq='H'),
    'C1': np.random.choice(list('PYTHON'), n),
    'C2': np.random.choice(list('FUN'), n),
    'X1': np.linspace(0, 2*math.pi, n),
    'X2': np.random.randn(n),
})

# add a new variable that is a function of X1, X2 + a random outlier probability
df['X3'] = 0.2 * df['X2'] + np.sin(df['X1']) + np.random.choice(a=[0, 2], size=n, p=[0.98, 0.02])

# make it a time series for later resampling use cases.
ts = df.set_index('T')

#  SOLUTION: Add the ewma line with groupby().transform().
ts['ewm'] = ts.groupby(['C1'])['X3'].transform(lambda x: x.ewm(halflife=1).mean())

# plot the points and ewma using altair faceting and layering
points = alt.Chart().mark_circle(size=20, opacity=0.9).encode(
    x = 'T', 
    y = 'X3',
    color = 'C2',
).properties(width=270, height=170)

lines = alt.Chart().mark_line(size=1, color='red', opacity=1).encode(
    x = 'T', 
    y = 'ewm'
)

alt.layer(points, lines).facet(facet='C1', data=ts.reset_index()).properties(columns=3)

在此处输入图像描述

Let's fix the problem, using transform :

t['ewm'] = ts.groupby(['C1'])['X1'].transform(lambda x: x.ewm(halflife=10).mean()).values()

Can you try this? Do NOT set ts = df.set_index('T') . Then you can do as below

ts['ewm']=ts.groupby(['C1'], sort=False).apply(lambda x: x['X1'].ewm(halflife=10).mean()).reset_index(drop=True)

The accepted answer is very slow for large datasets.

What I have done is:

ts['ewm'] = ts.groupby(['C1']).ewm(halflife=10).mean().values

and it seems to work just fine

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM