简体   繁体   English

Pandas:基于局部最小值-最大值的数据锯齿形分割

[英]Pandas: Zigzag segmentation of data based on local minima-maxima

I have a timeseries data.我有一个时间序列数据。 Generating data生成数据

date_rng = pd.date_range('2019-01-01', freq='s', periods=400)
df = pd.DataFrame(np.random.lognormal(.005, .5,size=(len(date_rng), 3)),
                  columns=['data1', 'data2', 'data3'],
                  index= date_rng)
s = df['data1']

I want to create a zig-zag line connecting between the local maxima and local minima, that satisfies the condition that on the y-axis, |highest - lowest value|我想创建一条连接局部最大值和局部最小值的之字形线,它满足在 y 轴上|highest - lowest value| of each zig-zag line must exceed a percentage (say 20%) of the distance of the previous zig-zag line, AND a pre-stated value k (say 1.2)每条之字形线的距离必须超过前一个之字形线距离的百分比(比如 20%),以及预先设定的值 k(比如 1.2)

I can find the local extrema using this code:我可以使用以下代码找到局部极值:

# Find peaks(max).
peak_indexes = signal.argrelextrema(s.values, np.greater)
peak_indexes = peak_indexes[0]

# Find valleys(min).
valley_indexes = signal.argrelextrema(s.values, np.less)
valley_indexes = valley_indexes[0]
# Merge peaks and valleys data points using pandas.
df_peaks = pd.DataFrame({'date': s.index[peak_indexes], 'zigzag_y': s[peak_indexes]})
df_valleys = pd.DataFrame({'date': s.index[valley_indexes], 'zigzag_y': s[valley_indexes]})
df_peaks_valleys = pd.concat([df_peaks, df_valleys], axis=0, ignore_index=True, sort=True)

# Sort peak and valley datapoints by date.
df_peaks_valleys = df_peaks_valleys.sort_values(by=['date'])

but I don't know how to apply the threshold condition to it.但我不知道如何将阈值条件应用于它。 Please advise me on how to apply such condition.请告诉我如何应用这样的条件。

Since the data could contain million timestamps, an efficient calculation is highly recommended由于数据可能包含数百万个时间戳,因此强烈建议进行高效计算

For clearer description:更清晰的描述: 在此处输入图片说明

Example output, from my data:示例输出,来自我的数据:

 # Instantiate axes.
(fig, ax) = plt.subplots()
# Plot zigzag trendline.
ax.plot(df_peaks_valleys['date'].values, df_peaks_valleys['zigzag_y'].values, 
                                                        color='red', label="Zigzag")

# Plot original line.
ax.plot(s.index, s, linestyle='dashed', color='black', label="Org. line", linewidth=1)

# Format time.
ax.xaxis_date()
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m-%d"))

plt.gcf().autofmt_xdate()   # Beautify the x-labels
plt.autoscale(tight=True)

plt.legend(loc='best')
plt.grid(True, linestyle='dashed')

在此处输入图片说明

My desired output (something similar to this, the zigzag only connect the significant segments)我想要的输出(类似于这个,锯齿形只连接重要的部分) 在此处输入图片说明

I have answered to my best understanding of the question.我已经回答了我对这个问题的最佳理解。 Yet it is not clear to how the variable K influences the filter.然而,尚不清楚变量 K 如何影响滤波器。

You want to filter the extrema based on a running condition.您想根据运行条件过滤极值。 I assume that you want to mark all extrema whose relative distance to the last marked extremum is larger than p%.我假设您要标记与最后一个标记极值的相对距离大于 p% 的所有极值。 I further assume that you always consider the first element of the timeseries a valid/relevant point.我进一步假设您始终将时间序列的第一个元素视为有效/相关点。

I implemented this with the following filter function:我使用以下过滤器功能实现了这一点:

def filter(values, percentage):
    previous = values[0] 
    mask = [True]
    for value in values[1:]: 
        relative_difference = np.abs(value - previous)/previous
        if relative_difference > percentage:
            previous = value
            mask.append(True)
        else:
            mask.append(False)
    return mask

To run your code, I first import dependencies:要运行您的代码,我首先导入依赖项:

from scipy import signal
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

To make the code reproduceable I fix the random seed:为了使代码可重现,我修复了随机种子:

np.random.seed(0)

The rest from here is copypasta.剩下的就是copypasta。 Note that I decreased the amount of sample to make the result clear.请注意,我减少了样本量以使结果清晰。

date_rng = pd.date_range('2019-01-01', freq='s', periods=30)
df = pd.DataFrame(np.random.lognormal(.005, .5,size=(len(date_rng), 3)),
                  columns=['data1', 'data2', 'data3'],
                  index= date_rng)
s = df['data1']
# Find peaks(max).
peak_indexes = signal.argrelextrema(s.values, np.greater)
peak_indexes = peak_indexes[0]
# Find valleys(min).
valley_indexes = signal.argrelextrema(s.values, np.less)
valley_indexes = valley_indexes[0]
# Merge peaks and valleys data points using pandas.
df_peaks = pd.DataFrame({'date': s.index[peak_indexes], 'zigzag_y': s[peak_indexes]})
df_valleys = pd.DataFrame({'date': s.index[valley_indexes], 'zigzag_y': s[valley_indexes]})
df_peaks_valleys = pd.concat([df_peaks, df_valleys], axis=0, ignore_index=True, sort=True)
# Sort peak and valley datapoints by date.
df_peaks_valleys = df_peaks_valleys.sort_values(by=['date'])

Then we use the filter function:然后我们使用过滤函数:

p = 0.2 # 20% 
filter_mask = filter(df_peaks_valleys.zigzag_y, p)
filtered = df_peaks_valleys[filter_mask]

And plot as you did both your previous plot as well as the newly filtered extrema:并按照您之前的绘图以及新过滤的极值进行绘图:

 # Instantiate axes.
(fig, ax) = plt.subplots(figsize=(10,10))
# Plot zigzag trendline.
ax.plot(df_peaks_valleys['date'].values, df_peaks_valleys['zigzag_y'].values, 
                                                        color='red', label="Extrema")
# Plot zigzag trendline.
ax.plot(filtered['date'].values, filtered['zigzag_y'].values, 
                                                        color='blue', label="ZigZag")

# Plot original line.
ax.plot(s.index, s, linestyle='dashed', color='black', label="Org. line", linewidth=1)

# Format time.
ax.xaxis_date()
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m-%d"))

plt.gcf().autofmt_xdate()   # Beautify the x-labels
plt.autoscale(tight=True)

plt.legend(loc='best')
plt.grid(True, linestyle='dashed')

在此处输入图片说明

EDIT :编辑

If want to both consider the first as well as the last point as valid, then you can adapt the filter function as follows:如果要同时考虑第一个和最后一个点都有效,则可以按如下方式调整过滤器函数:

def filter(values, percentage):
    # the first value is always valid
    previous = values[0] 
    mask = [True]
    # evaluate all points from the second to (n-1)th
    for value in values[1:-1]: 
        relative_difference = np.abs(value - previous)/previous
        if relative_difference > percentage:
            previous = value
            mask.append(True)
        else:
            mask.append(False)
    # the last value is always valid
    mask.append(True)
    return mask

You can use Pandas rolling functionality to create the local extrema.您可以使用 Pandas 滚动功能来创建局部极值。 That simplifies the code a little compared to your Scipy approach.与您的 Scipy 方法相比,这稍微简化了代码。

Functions to find the extrema:查找极值的函数:

def islocalmax(x):
    """Both neighbors are lower,
    assumes a centered window of size 3"""
    return (x[0] < x[1]) & (x[2] < x[1])

def islocalmin(x):
    """Both neighbors are higher,
    assumes a centered window of size 3"""
    return (x[0] > x[1]) & (x[2] > x[1])

def isextrema(x):
    return islocalmax(x) or islocalmin(x)

The function to create the zigzag, it can be applied on the Dataframe at once (over each column), but this will introduce NaN's since the returned timestamps will be different for each column.创建之字形的函数,它可以一次应用于数据帧(在每一列上),但这将引入 NaN,因为每列返回的时间戳将不同。 You can easily drop these later as shown in the example below, or simply apply the function on a single column in your Dataframe.您可以稍后轻松删除它们,如下面的示例所示,或者只需将该函数应用于 Dataframe 中的单个列。

Note that I uncommented the test against a threshold k , I'm not sure if fully understand that part correctly.请注意,我取消了针对阈值k的测试的注释,我不确定是否完全正确理解了该部分。 You can include it if the absolute difference between the previous and current extreme needs to be bigger than k : & (ext_val.diff().abs() > k)如果前一个和当前极端之间的绝对差异需要大于k ,则可以包含它: & (ext_val.diff().abs() > k)

I'm also not sure if the final zigzag should always move from an original high to a low or vice versa.我也不确定最终的锯齿形曲线是否应该始终从原始高点移动到低点,反之亦然。 I assumed it should, otherwise you can remove the second search for extreme at the end of the function.我认为应该这样做,否则您可以在函数末尾删除第二次对极端的搜索。

def create_zigzag(col, p=0.2, k=1.2):

    # Find the local min/max
    # converting to bool converts NaN to True, which makes it include the endpoints    
    ext_loc = col.rolling(3, center=True).apply(isextrema, raw=False).astype(np.bool_)

    # extract values at local min/max
    ext_val = col[ext_loc]

    # filter locations based on threshold
    thres_ext_loc = (ext_val.diff().abs() > (ext_val.shift(-1).abs() * p)) #& (ext_val.diff().abs() > k)

    # Keep the endpoints
    thres_ext_loc.iloc[0] = True
    thres_ext_loc.iloc[-1] = True

    thres_ext_loc = thres_ext_loc[thres_ext_loc]

    # extract values at filtered locations 
    thres_ext_val = col.loc[thres_ext_loc.index]

    # again search the extrema to force the zigzag to always go from high > low or vice versa,
    # never low > low, or high > high
    ext_loc = thres_ext_val.rolling(3, center=True).apply(isextrema, raw=False).astype(np.bool_)
    thres_ext_val  =thres_ext_val[ext_loc]

    return thres_ext_val

Generate some sample data:生成一些示例数据:

date_rng = pd.date_range('2019-01-01', freq='s', periods=35)

df = pd.DataFrame(np.random.randn(len(date_rng), 3),
                  columns=['data1', 'data2', 'data3'],
                  index= date_rng)

df = df.cumsum()

Apply the function and extract the result for the 'data1' column:应用该函数并提取“data1”列的结果:

dfzigzag = df.apply(create_zigzag)
data1_zigzag = dfzigzag['data1'].dropna()

Visualize the result:可视化结果:

fig, axs = plt.subplots(figsize=(10, 3))

axs.plot(df.data1, 'ko-', ms=4, label='original')
axs.plot(data1_zigzag, 'ro-', ms=4, label='zigzag')
axs.legend()

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM