简体   繁体   English

时间序列重采样

[英]Timeseries Resampling

I have a dataset of the following form dropbox download (23kb csv) 我有以下表单保管箱下载的数据集(23kb csv)

The sample rate of the data varies from second to second from 0Hz to over 200Hz in some cases, the highest rate of samples in the data set provided is about 50 samples per second. 在某些情况下,数据的采样率从0Hz到200Hz以上每秒变化一次,在提供的数据集中,最高采样率约为每秒50个采样。

When samples are taken they are always even spread across the second for example 例如,当取样时,它们总是均匀分布在第二个样品上

time                   x
2012-12-06 21:12:40    128.75909883327378
2012-12-06 21:12:40     32.799224301545976
2012-12-06 21:12:40     98.932953779777989
2012-12-06 21:12:43    132.07033814856786
2012-12-06 21:12:43    132.07033814856786
2012-12-06 21:12:43     65.71691352191452
2012-12-06 21:12:44    117.1350194748169
2012-12-06 21:12:45     13.095622561808861
2012-12-06 21:12:47     61.295242676059246
2012-12-06 21:12:48     94.774064119961352
2012-12-06 21:12:49     80.169378222553533
2012-12-06 21:12:49     80.291142695702533
2012-12-06 21:12:49    136.55650749231367
2012-12-06 21:12:49    127.29790925838365

should be 应该

time                        x
2012-12-06 21:12:40 000ms   128.75909883327378
2012-12-06 21:12:40 333ms    32.799224301545976
2012-12-06 21:12:40 666ms    98.932953779777989
2012-12-06 21:12:43 000ms   132.07033814856786
2012-12-06 21:12:43 333ms   132.07033814856786
2012-12-06 21:12:43 666ms    65.71691352191452
2012-12-06 21:12:44 000ms   117.1350194748169
2012-12-06 21:12:45 000ms    13.095622561808861
2012-12-06 21:12:47 000ms    61.295242676059246
2012-12-06 21:12:48 000ms    94.774064119961352
2012-12-06 21:12:49 000ms    80.169378222553533
2012-12-06 21:12:49 250ms    80.291142695702533
2012-12-06 21:12:49 500ms   136.55650749231367
2012-12-06 21:12:49 750ms   127.29790925838365

is there an easy way to use the pandas timeseries resampling function or is there some thing built into numpy or scipy that will work? 有没有一种简单的方法可以使用熊猫时间序列重采样功能,或者在numpy或scipy中内置了某些可以起作用的东西?

I don't think there is an inbuilt pandas or numpy method/function to do this. 我认为没有内置的熊猫或numpy方法/功能可以做到这一点。

However, I would favour using a python generator: 但是,我更喜欢使用python生成器:

def repeats(lst):
    i_0 = None
    n = -1 # will still work if lst starts with None
    for i in lst:
        if i == i_0:
            n += 1
        else:
            n = 0
        yield n
        i_0 = i
# list(repeats([1,1,1,2,2,3])) == [0,1,2,0,1,0]

Then you can put this generator into a numpy array : 然后,您可以将此生成器放入numpy数组中

import numpy as np
df['rep'] = np.array(list(repeats(df['time'])))

Count up the repeats: 计算重复次数:

from collections import Counter
count = Counter(df['time'])
df['count'] = df['time'].apply(lambda x: count[x])

and do the calculation (this is the most expensive part of the calculation): 并进行计算(这是计算中最昂贵的部分):

df['time2'] = df.apply(lambda row: (row['time'] 
                                 + datetime.timedelta(0, 1) # 1s
                                     * row['rep'] 
                                     / row['count']),
                 axis=1)

Note: to remove the calculation columns, use del df['rep'] and del df['count'] . 注意:要删除计算列,请使用del df['rep']del df['count']

.

One "built-in" way to accomplish it might be accomplished using shift twice, but I think this is going to be somewhat messier... 一种完成此操作的“内置”方法可以使用两次shift来完成,但是我认为这会有些混乱。

I found this an excellent use case for pandas groupby mechanism, so I wanted to provide a solution for this as well. 我发现这是pandas groupby机制的绝佳用例,因此我也想为此提供解决方案。 I find it slightly more readible than Andy's solution, but it's actually not thaat much shorter. 我发现它比Andy的解决方案更具可读性,但实际上并没有那么短。

# First, get your data into a dataframe after having copied 
# it with the mouse into a multi-line string:

import pandas as pd
from StringIO import StringIO

s = """2012-12-06 21:12:40    128.75909883327378
2012-12-06 21:12:40     32.799224301545976
2012-12-06 21:12:40     98.932953779777989
2012-12-06 21:12:43    132.07033814856786
2012-12-06 21:12:43    132.07033814856786
2012-12-06 21:12:43     65.71691352191452
2012-12-06 21:12:44    117.1350194748169
2012-12-06 21:12:45     13.095622561808861
2012-12-06 21:12:47     61.295242676059246
2012-12-06 21:12:48     94.774064119961352
2012-12-06 21:12:49     80.169378222553533
2012-12-06 21:12:49     80.291142695702533
2012-12-06 21:12:49    136.55650749231367
2012-12-06 21:12:49    127.29790925838365"""

sio = StringIO(s)
df = pd.io.parsers.read_csv(sio, parse_dates=[[0,1]], sep='\s*', header=None)
df = df.set_index('0_1')
df.index.name = 'time'
df.columns = ['x']

So far, this was only data preparation, so if you want to compare length of the solutions, do it from now on! 到目前为止,这仅仅是数据准备,因此,如果您想比较解决方案的长度,请从现在开始! ;) ;)

# Now, groupby the same time indices:

grouped = df.groupby(df.index)

# Create yourself a second object
from datetime import timedelta
second = timedelta(seconds=1)

# loop over group elements, catch new index parts in list
l = []
for _,group in grouped:
    size = len(group)
    if size == 1:
        # go to pydatetime for later addition, so that list is all in 1 format
        l.append(group.index.to_pydatetime())
    else:
        offsets = [i * second / size for i in range(size)]
        l.append(group.index.to_pydatetime() + offsets)

# exchange index for new index
import numpy as np
df.index = pd.DatetimeIndex(np.concatenate(l))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM