在间隙处拆分时间序列数据（pd.Series）的更有效方法？

Question

I am trying to split a pd.Series with sorted dates that have sometimes gaps between them that are bigger than the normal ones.我正在尝试将 pd.Series 与排序日期分开，这些日期之间有时会出现比正常情况更大的差距。 To do this, I calculated the size of the gaps with pd.Series.diff() and then iterated over all the elements in the series with a while-loop.为此，我使用 pd.Series.diff() 计算了间隙的大小，然后使用 while 循环遍历系列中的所有元素。 But this is unfortunately quite computationally intensive.但不幸的是，这需要大量的计算。 Is there a better way (apart from parallelization)?有没有更好的方法（除了并行化）？

Minimal example with my function:我的功能的最小示例：

import pandas as pd
import time


def get_samples_separated_at_gaps(data: pd.Series, normal_gap) -> list:
    diff = data.diff()
    # creating list that should contains all samples
    samples_list = [pd.Series(data[0])]
    i = 1
    while i < len(data):
        if diff[i] == normal_gap:
            # normal gap: add data[i] to last sample in samples_list
            samples_list[-1] = samples_list[-1].append(pd.Series(data[i]))
        else:
            # not normal gap: creating new sample in samples_list
            samples_list.append(pd.Series(data[i]))
        i += 1
    return samples_list


# make sample data as example
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])

# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
# start sampling
start_time = time.time()
my_list_with_samples = get_samples_separated_at_gaps(data_with_samples, normal_distance)
print(f"Duration: {time.time() - start_time}")

The real data have a size of over 150k and are calculated for several minutes... :/真实数据的大小超过 150k 并且计算了几分钟......：/

Answer 1

Your code is a bit unclear regarding the method to store these two different lists.关于存储这两个不同列表的方法，您的代码有点不清楚。 Specifically, I'm not sure what is the correct structure of sample_list that you have in mind.具体来说，我不确定您想到的sample_list的正确结构是什么。

Regardless, using Series.pct_change and np.unique() you should achieve approximately what you're looking for.无论如何，使用Series.pct_change和np.unique()你应该大致达到你想要的。

uniques, indices = np.unique(
    data_with_samples.diff()
        [1:]
        .pct_change(),
    return_index=True)

Now indices points you to the start and end of that wrong gap.现在indices您指向错误差距的开始和结束。

If your data will have more than one gap then you'd want to only use diff()[1:].pct_change() and look for all values that are different than 0 using where() .如果您的数据将有多个间隙，那么您只想使用diff()[1:].pct_change()并使用where()查找所有与 0 不同的值。

Answer 2

same as above question mention与上述问题相同

normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])

# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)

use time diff to compare with the normal_distance.seconds使用 time diff 与 normal_distance.seconds 进行比较
create an auxiliary column tag to separate the gap group创建一个辅助列tag来分隔间隙组

# start sampling
start_time = time.time()
df = data_with_samples.to_frame()
df['time_diff'] = df[0].diff().dt.seconds
cond = (df['time_diff'] > normal_distance.seconds) | (df['time_diff'].isnull())
df['tag'] = np.where(cond, 1, 0)
df['tag'] = df['tag'].cumsum()
my_list_with_samples = []
for _, group in df.groupby('tag'):
    my_list_with_samples.append(group[0])
print(f"Duration: {time.time() - start_time}")

Answer 3

I'm not sure I understand completely what you want but I think this could work:我不确定我完全理解你想要什么，但我认为这可以工作：

...
data_with_samples = first_sample.append(second_sample, ignore_index=True)

idx = data_with_samples[data_with_samples.diff(1) > normal_distance].index
samples_list = [data_with_samples]
if len(idx) > 0:
    samples_list = ([data_with_samples.iloc[:idx[0]]]
                    + [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
                    + [data_with_samples.iloc[idx[-1]:]])

idx collects the indicees directly after a gap, and the rest is just splitting the series at this indicees and packing the pieces into the list samples_list . idx在间隔后直接收集索引，其余的只是在此索引处拆分系列并将这些片段打包到列表samples_list 。

If the index is non-standard, then you need some overhead (resetting index and later setting the index back to the original) to make sure that iloc can be used.如果索引是非标准的，那么您需要一些开销（重置索引并稍后将索引设置回原始索引）以确保可以使用iloc 。

...
data_with_samples = first_sample.append(second_sample, ignore_index=True)

data_with_samples = data_with_samples.reset_index(drop=False).rename(columns={0: 'data'})
idx = data_with_samples.data[data_with_samples.data.diff(1) > normal_distance].index
data_with_samples.set_index('index', drop=True, inplace=True)
samples_list = [data_with_samples]
if len(idx) > 0:
    samples_list = ([data_with_samples.iloc[:idx[0]]]
                    + [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
                    + [data_with_samples.iloc[idx[-1]:]])

(You don't need that for your example.) （对于您的示例，您不需要它。）

在间隙处拆分时间序列数据（pd.Series）的更有效方法？

问题描述

3 个解决方案

解决方案1
0 2021-06-29 10:39:51

解决方案2
0 2021-06-29 10:40:33

解决方案3
0 已采纳 2021-06-29 11:00:17

在间隙处拆分时间序列数据（pd.Series）的更有效方法？

问题描述

3 个解决方案

解决方案1 0 2021-06-29 10:39:51

解决方案2 0 2021-06-29 10:40:33

解决方案3 0 已采纳 2021-06-29 11:00:17

解决方案1
0 2021-06-29 10:39:51

解决方案2
0 2021-06-29 10:40:33

解决方案3
0 已采纳 2021-06-29 11:00:17