简体   繁体   English

如何检查一个 Pandas 时间序列是否存在于另一个长时间序列中?

[英]How to check if one Pandas time-series is present in another long time-series?

I have two very long time-series.我有两个很长的时间序列。 I have to check if Series B is present(in the given order) in Series A.我必须检查 A 系列中是否存在 B 系列(按给定顺序)。

Series A: 1,2,3,4,5,6,5,4,3. A系列:1、2、3、4、5、6、5、4、3。 Series B: 3,4,5. B 系列:3、4、5。

Result: True, with index where the small series first element found.结果:True,带有找到小系列第一个元素的索引。 Here, index:2 (as 3 is present at index 2 in Series A)这里,索引:2(因为 3 出现在 A 系列的索引 2 处)

Note: The two series are quite big.注:这两个系列相当大。 let's say A contains 50000 elements and B contains 350.假设 A 包含 50000 个元素,B 包含 350 个。

a very slow solution is to convert series to list and check if first list is a subset of the main list in order一个非常慢的解决方案是将系列转换为列表并按顺序检查第一个列表是否是主列表的子集

def is_series_a_subseries_in_order(main, sub):
    n = len(sub)
    main=main.tolist()
    sub=sub.tolist()
    return any((main[i:i+n] == sub) for i in range(len(main)-n+1))

will return True or False将返回TrueFalse

A naive approach is to check for B(1) in A. In your example B(1) = A(3), so now you have to check if B(2) = A(4) and you continue till the end of your substring... If it's not correct, start with A(4) and continue till the end.一种天真的方法是检查 A 中的 B(1)。在你的例子中 B(1) = A(3),所以现在你必须检查 B(2) = A(4) 并继续直到结束你的子串...如果不正确,从 A(4) 开始,一直持续到最后。

A better way to search for a substring is to apply Knuth-Morris-Pratt's algorithm.搜索子串的更好方法是应用 Knuth-Morris-Pratt 算法。 I'll let you search for more information about it!我会让你搜索更多关于它的信息!

Unluckily the rolling method of pandas does not allow being used as an iterator, even though implementation is planned in #11704 .不幸的是, pandasrolling方法不允许用作迭代器,即使在#11704 中计划实现。
Thus we have to implement a rolling window for subset checking on our own.因此,我们必须自己实现一个滚动窗口来进行子集检查。

ser_a = pd.Series(data=[1, 2, 3, 4, 5, 6, 5, 4, 3])
ser_b = pd.Series(data=[3, 4, 5])

slider_df = pd.concat(
    [ser_a.shift(-i)[:ser_b.size] for i in range(ser_a.size - ser_b.size + 1)], 
    axis=1).astype(ser_a.dtype).T
sub_series = (ser_b == slider_df).all(axis=1)
# if you want, you can extract only the indices where a subseries was found:
sub_series_startindex = sub_series.index[sub_series]

What I am doing here:我在这里做什么:

  • [ser_a.shift(-i)[:ser_b.size] for i in range(ser_a.size - ser_b.size + 1)] : Create a "rolling window" by increased shifting of ser_a , limited to the size of the sub series ser_b to check for. [ser_a.shift(-i)[:ser_b.size] for i in range(ser_a.size - ser_b.size + 1)] :通过增加ser_a移位创建一个“滚动窗口”,仅限于子的大小要检查的系列ser_b Since shifts at the end will yield NaN , these are excluded in the range .由于最后的移位将产生NaN ,因此这些被排除在range
  • pd.concat(..., axis=1) : Concatenate shifted Series, so that slider_df contains all shifts in the columns. pd.concat(..., axis=1) :连接移位系列,以便slider_df包含列中的所有移位。
  • .astype(ser_a.dtype) : is strictly optional. .astype(ser_a.dtype) :是严格可选的。 For large Series this may improve performance, for small Series it may degrade performance.对于大型系列,这可能会提高性能,对于小型系列,它可能会降低性能。
  • .T : transpose df, so that sub-series-index are aligned by axis 0. .T :转置 df,以便子系列索引按轴 0 对齐。
  • sub_series = (ser_b == slider_df).all(axis=1) : Find where ser_b matches sub-series. sub_series = (ser_b == slider_df).all(axis=1) :查找ser_b匹配子系列的位置。
  • sub_series.index[sub_series] : extract the indices, where a matching sub-series was found. sub_series.index[sub_series] :提取找到匹配子系列的索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM