简体   繁体   English

在间隙处拆分时间序列数据(pd.Series)的更有效方法?

[英]A more efficient way to split timeseries data (pd.Series) at gaps?

I am trying to split a pd.Series with sorted dates that have sometimes gaps between them that are bigger than the normal ones.我正在尝试将 pd.Series 与排序日期分开,这些日期之间有时会出现比正常情况更大的差距。 To do this, I calculated the size of the gaps with pd.Series.diff() and then iterated over all the elements in the series with a while-loop.为此,我使用 pd.Series.diff() 计算了间隙的大小,然后使用 while 循环遍历系列中的所有元素。 But this is unfortunately quite computationally intensive.但不幸的是,这需要大量的计算。 Is there a better way (apart from parallelization)?有没有更好的方法(除了并行化)?

Minimal example with my function:我的功能的最小示例:

import pandas as pd
import time


def get_samples_separated_at_gaps(data: pd.Series, normal_gap) -> list:
    diff = data.diff()
    # creating list that should contains all samples
    samples_list = [pd.Series(data[0])]
    i = 1
    while i < len(data):
        if diff[i] == normal_gap:
            # normal gap: add data[i] to last sample in samples_list
            samples_list[-1] = samples_list[-1].append(pd.Series(data[i]))
        else:
            # not normal gap: creating new sample in samples_list
            samples_list.append(pd.Series(data[i]))
        i += 1
    return samples_list


# make sample data as example
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])

# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
# start sampling
start_time = time.time()
my_list_with_samples = get_samples_separated_at_gaps(data_with_samples, normal_distance)
print(f"Duration: {time.time() - start_time}")

The real data have a size of over 150k and are calculated for several minutes... :/真实数据的大小超过 150k 并且计算了几分钟......:/

Your code is a bit unclear regarding the method to store these two different lists.关于存储这两个不同列表的方法,您的代码有点不清楚。 Specifically, I'm not sure what is the correct structure of sample_list that you have in mind.具体来说,我不确定您想到的sample_list的正确结构是什么。

Regardless, using Series.pct_change and np.unique() you should achieve approximately what you're looking for.无论如何,使用Series.pct_changenp.unique()你应该大致达到你想要的。

uniques, indices = np.unique(
    data_with_samples.diff()
        [1:]
        .pct_change(),
    return_index=True)

Now indices points you to the start and end of that wrong gap.现在indices您指向错误差距的开始和结束。

If your data will have more than one gap then you'd want to only use diff()[1:].pct_change() and look for all values that are different than 0 using where() .如果您的数据将有多个间隙,那么您只想使用diff()[1:].pct_change()并使用where()查找所有与 0 不同的值。

  • same as above question mention与上述问题相同
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])

# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
  • use time diff to compare with the normal_distance.seconds使用 time diff 与 normal_distance.seconds 进行比较
  • create an auxiliary column tag to separate the gap group创建一个辅助列tag来分隔间隙组
# start sampling
start_time = time.time()
df = data_with_samples.to_frame()
df['time_diff'] = df[0].diff().dt.seconds
cond = (df['time_diff'] > normal_distance.seconds) | (df['time_diff'].isnull())
df['tag'] = np.where(cond, 1, 0)
df['tag'] = df['tag'].cumsum()
my_list_with_samples = []
for _, group in df.groupby('tag'):
    my_list_with_samples.append(group[0])
print(f"Duration: {time.time() - start_time}")

I'm not sure I understand completely what you want but I think this could work:我不确定我完全理解你想要什么,但我认为这可以工作:

...
data_with_samples = first_sample.append(second_sample, ignore_index=True)

idx = data_with_samples[data_with_samples.diff(1) > normal_distance].index
samples_list = [data_with_samples]
if len(idx) > 0:
    samples_list = ([data_with_samples.iloc[:idx[0]]]
                    + [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
                    + [data_with_samples.iloc[idx[-1]:]])

idx collects the indicees directly after a gap, and the rest is just splitting the series at this indicees and packing the pieces into the list samples_list . idx在间隔后直接收集索引,其余的只是在此索引处拆分系列并将这些片段打包到列表samples_list

If the index is non-standard, then you need some overhead (resetting index and later setting the index back to the original) to make sure that iloc can be used.如果索引是非标准的,那么您需要一些开销(重置索引并稍后将索引设置回原始索引)以确保可以使用iloc

...
data_with_samples = first_sample.append(second_sample, ignore_index=True)

data_with_samples = data_with_samples.reset_index(drop=False).rename(columns={0: 'data'})
idx = data_with_samples.data[data_with_samples.data.diff(1) > normal_distance].index
data_with_samples.set_index('index', drop=True, inplace=True)
samples_list = [data_with_samples]
if len(idx) > 0:
    samples_list = ([data_with_samples.iloc[:idx[0]]]
                    + [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
                    + [data_with_samples.iloc[idx[-1]:]])

(You don't need that for your example.) (对于您的示例,您不需要它。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM