[英]A more efficient way to split timeseries data (pd.Series) at gaps?
I am trying to split a pd.Series with sorted dates that have sometimes gaps between them that are bigger than the normal ones.我正在尝试将 pd.Series 与排序日期分开,这些日期之间有时会出现比正常情况更大的差距。 To do this, I calculated the size of the gaps with pd.Series.diff() and then iterated over all the elements in the series with a while-loop.
为此,我使用 pd.Series.diff() 计算了间隙的大小,然后使用 while 循环遍历系列中的所有元素。 But this is unfortunately quite computationally intensive.
但不幸的是,这需要大量的计算。 Is there a better way (apart from parallelization)?
有没有更好的方法(除了并行化)?
Minimal example with my function:我的功能的最小示例:
import pandas as pd
import time
def get_samples_separated_at_gaps(data: pd.Series, normal_gap) -> list:
diff = data.diff()
# creating list that should contains all samples
samples_list = [pd.Series(data[0])]
i = 1
while i < len(data):
if diff[i] == normal_gap:
# normal gap: add data[i] to last sample in samples_list
samples_list[-1] = samples_list[-1].append(pd.Series(data[i]))
else:
# not normal gap: creating new sample in samples_list
samples_list.append(pd.Series(data[i]))
i += 1
return samples_list
# make sample data as example
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
# start sampling
start_time = time.time()
my_list_with_samples = get_samples_separated_at_gaps(data_with_samples, normal_distance)
print(f"Duration: {time.time() - start_time}")
The real data have a size of over 150k and are calculated for several minutes... :/真实数据的大小超过 150k 并且计算了几分钟......:/
Your code is a bit unclear regarding the method to store these two different lists.关于存储这两个不同列表的方法,您的代码有点不清楚。 Specifically, I'm not sure what is the correct structure of
sample_list
that you have in mind.具体来说,我不确定您想到的
sample_list
的正确结构是什么。
Regardless, using Series.pct_change
and np.unique()
you should achieve approximately what you're looking for.无论如何,使用
Series.pct_change
和np.unique()
你应该大致达到你想要的。
uniques, indices = np.unique(
data_with_samples.diff()
[1:]
.pct_change(),
return_index=True)
Now indices
points you to the start and end of that wrong gap.现在
indices
您指向错误差距的开始和结束。
If your data will have more than one gap then you'd want to only use diff()[1:].pct_change()
and look for all values that are different than 0 using where()
.如果您的数据将有多个间隙,那么您只想使用
diff()[1:].pct_change()
并使用where()
查找所有与 0 不同的值。
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
tag
to separate the gap grouptag
来分隔间隙组# start sampling
start_time = time.time()
df = data_with_samples.to_frame()
df['time_diff'] = df[0].diff().dt.seconds
cond = (df['time_diff'] > normal_distance.seconds) | (df['time_diff'].isnull())
df['tag'] = np.where(cond, 1, 0)
df['tag'] = df['tag'].cumsum()
my_list_with_samples = []
for _, group in df.groupby('tag'):
my_list_with_samples.append(group[0])
print(f"Duration: {time.time() - start_time}")
I'm not sure I understand completely what you want but I think this could work:我不确定我完全理解你想要什么,但我认为这可以工作:
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
idx = data_with_samples[data_with_samples.diff(1) > normal_distance].index
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
idx
collects the indicees directly after a gap, and the rest is just splitting the series at this indicees and packing the pieces into the list samples_list
. idx
在间隔后直接收集索引,其余的只是在此索引处拆分系列并将这些片段打包到列表samples_list
。
If the index is non-standard, then you need some overhead (resetting index and later setting the index back to the original) to make sure that iloc
can be used.如果索引是非标准的,那么您需要一些开销(重置索引并稍后将索引设置回原始索引)以确保可以使用
iloc
。
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
data_with_samples = data_with_samples.reset_index(drop=False).rename(columns={0: 'data'})
idx = data_with_samples.data[data_with_samples.data.diff(1) > normal_distance].index
data_with_samples.set_index('index', drop=True, inplace=True)
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
(You don't need that for your example.) (对于您的示例,您不需要它。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.