运行时间序列数据的平均值/频率？

Question

Given a data set with something like: 给定的数据集如下：

[2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 65, 75, 85, 86, 87, 88]

The values are always increasing (in fact it's time), and I want to find out a running average distance between the values. 值总是在增加（实际上是时间），我想找出值之间的运行平均距离。 I am in effect trying to determine when the data goes from "1 every second" to "1 every 5 seconds" (or any other value). 我实际上试图确定数据何时从“每秒1次”变为“每5秒1次”（或任何其他值）。

I am implementing this in Python, but a solution in any language is most welcome. 我在Python中实现这一点，但欢迎使用任何语言的解决方案。

The output I am looking for the sample input above, would be something like: 我正在寻找上面的示例输入的输出，如下所示：

[(2, 1), (10, 5), (55, 10), (85, 1) ]

where, "2" would indicate where the distance between values started out being "1" and, and "10" would indicate where the distance shifted to being "5". 其中，“2”表示值之间的距离开始为“1”，而“10”表示距离变为“5”的位置。 (It would have to be exactly there, if the shift was detected a step later, wouldn't matter.) （它必须完全在那里，如果稍后检测到转变，则无关紧要。）

I am looking for when the average distance between values changes. 我正在寻找值之间的平均距离变化。 I realize there will be some kind of trade off between stability of the algorithm and sensitivity to changes in the input. 我意识到在算法的稳定性和对输入变化的敏感性之间会有某种折衷。

_{(Is Pandas or NumPy useful for this btw?)} _{（熊猫或NumPy对于这个btw有用吗？）}

Answer 1

You could use numpy or pandas like so (the "pandas version"): 您可以像这样使用numpy或pandas（“pandas版本”）：

In [256]: s = pd.Series([2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35,
                             40, 45, 50, 55, 65, 75, 85, 86, 87, 88])

In [257]: df = pd.DataFrame({'time': s,
                             'time_diff': s.diff().shift(-1)}).set_index('time')

In [258]: df[df.time_diff - df.time_diff.shift(1) != 0].dropna()
Out[258]: 
      time_diff
time           
2             1
10            5
55           10
85            1

If you only want to look at the first occurrence of every time step you could also use: 如果您只想查看每个时间步的第一次出现，您还可以使用：

In [259]: df.drop_duplicates().dropna() # set take_last=True if you want the last
Out[259]: 
      time_diff
time           
2             1
10            5
55           10

However with pandas you would normally use a DatetimeIndex to use the built in time series functionality: 但是，对于pandas，您通常会使用DatetimeIndex来使用内置的时间序列功能：

In [44]: a = [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35,
              40, 45, 50, 55, 65, 75, 85, 86, 87, 88]

In [45]: start_time = datetime.datetime.now()

In [46]: times = [start_time + datetime.timedelta(seconds=int(x)) for x in a]

In [47]: idx = pd.DatetimeIndex(times)

In [48]: df = pd.DataFrame({'data1': np.random.rand(idx.size), 
                            'data2': np.random.rand(idx.size)},
                           index=idx)

In [49]: df.resample('5S') # resample to 5 Seconds
Out[49]: 
                        data1     data2
2012-11-28 07:36:35  0.417282  0.477837
2012-11-28 07:36:40  0.536367  0.451494
2012-11-28 07:36:45  0.902018  0.457873
2012-11-28 07:36:50  0.452151  0.625526
2012-11-28 07:36:55  0.816028  0.170319
2012-11-28 07:37:00  0.169264  0.723092
2012-11-28 07:37:05  0.809279  0.794459
2012-11-28 07:37:10  0.652836  0.615056
2012-11-28 07:37:15  0.508318  0.147178
2012-11-28 07:37:20  0.261157  0.509014
2012-11-28 07:37:25  0.609685  0.324375
2012-11-28 07:37:30       NaN       NaN
2012-11-28 07:37:35  0.736370  0.551477
2012-11-28 07:37:40       NaN       NaN
2012-11-28 07:37:45  0.839960  0.118619
2012-11-28 07:37:50       NaN       NaN
2012-11-28 07:37:55  0.697292  0.394946
2012-11-28 07:38:00  0.351824  0.420454

From my point of view, for working with time series Pandas is by far the best library available in the Python ecosystem. 从我的角度来看，对于时间序列的使用，Pandas是迄今为止Python生态系统中最好的库。 Not sure what you really want to do, but I would give pandas a try. 不确定你真正想做什么，但我会尝试一下熊猫。

Answer 2

a = [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 34, 40, 45, 46, 50, 55]

ans = [(a[0], a[1]-a[0])]
for i in range(1, len(a)-1):
    if a[i+1] - a[i] - a[i] + a[i-1] is not 0:
        ans.append((a[i], a[i+1]-a[i]))

print ans

Output: 输出：

[(2, 1), (10, 5), (30, 4), (34, 6), (40, 5), (45, 1), (46, 4), (50, 5)]

Is it what you want ? 这是你想要的吗？

Answer 3

How about this generator: 这台发电机怎么样：

L = [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 34, 40, 45, 46, 50, 55]

def differences_gen(L, differences):
    previous = L[0]
    differences = iter(differences + [None])
    next_diff = next(differences)
    for i, n in enumerate(L[1:]):
        current_diff = n - previous
        while next_diff is not None and current_diff >= next_diff:
            yield (previous, next_diff)
            next_diff = next(differences)
        previous = n

list(differences_gen(L, [1,5]))
# [(2, 1), (10, 5)]

There's probably a cleaner way to iterate over partition, but using generators should keep it efficient for longer L and differences . 可能有一种更简洁的方法来迭代分区，但使用生成器应该保持它有效更长的L和differences 。

Answer 4

In Python: 在Python中：

a = [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 34, 40, 45, 46, 50, 55]
# zip() creates tuples of two consecutive values 
# (it zips lists of different length by truncating longer list(s))
# then tuples with first value and difference are placed in 'diff' list
diff = [(x, y-x) for x, y in zip(a, a[1:])]
# now pick only elements with changed difference 
result = []
for pair in diff:
    if not len(result) or result[-1][1]!=pair[1]: # -1 to take last element
        result.append(pair)

Answer 5

I'm a fan of using a window function via islice , it's very useful and I find myself reusing it a lot: 我喜欢通过islice使用窗口函数，它非常有用，我发现自己重复使用它：

from itertools import islice

def window(seq, n=2):
    """
    Returns a sliding window (of width n) over data from the iterable
    s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   
    """
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

# Main code:
last_diff = None
results = []
for v1, v2 in window(a):
    diff = abs(v1 - v2)
    if diff != last_diff:
        results.append((v1, diff))
    last_diff = diff

Result: 结果：

[(2, 1), (10, 5), (30, 4), (34, 6), (40, 5), (45, 1), (46, 4), (50, 5)]

运行时间序列数据的平均值/频率？

问题描述

5 个解决方案

解决方案1
3 2012-11-27 12:13:10

解决方案2
2 2012-11-27 10:03:44

解决方案3
2 2012-11-27 10:06:26

解决方案4
2 已采纳 2012-11-27 10:24:32

解决方案5
2 2012-11-27 11:14:09

运行时间序列数据的平均值/频率？

问题描述

5 个解决方案

解决方案1 3 2012-11-27 12:13:10

解决方案2 2 2012-11-27 10:03:44

解决方案3 2 2012-11-27 10:06:26

解决方案4 2 已采纳 2012-11-27 10:24:32

解决方案5 2 2012-11-27 11:14:09

解决方案1
3 2012-11-27 12:13:10

解决方案2
2 2012-11-27 10:03:44

解决方案3
2 2012-11-27 10:06:26

解决方案4
2 已采纳 2012-11-27 10:24:32

解决方案5
2 2012-11-27 11:14:09