简体   繁体   English

Python Pandas:检测时间序列的频率

[英]Python Pandas: Detecting frequency of time series

Assume I have loaded a time series data from sql or csv (not created in python), the index would be: 假设我已经从sql或csv(不是在python中创建)加载了时间序列数据,索引将是:

DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
               '2015-03-02 02:00:00', '2015-03-02 03:00:00',
               '2015-03-02 04:00:00', '2015-03-02 05:00:00',
               '2015-03-02 06:00:00', '2015-03-02 07:00:00',
               '2015-03-02 08:00:00', '2015-03-02 09:00:00', 
               ...
               '2015-07-19 14:00:00', '2015-07-19 15:00:00',
               '2015-07-19 16:00:00', '2015-07-19 17:00:00',
               '2015-07-19 18:00:00', '2015-07-19 19:00:00',
               '2015-07-19 20:00:00', '2015-07-19 21:00:00',
               '2015-07-19 22:00:00', '2015-07-19 23:00:00'],
              dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)

As you can see, the 'freq' is None. 如您所见,'freq'为None。 I am wondering how can I detect the frequency of this series and set the 'freq' as its frequency. 我想知道如何检测此系列的频率并将'freq'设置为其频率。

If possible, I wish this would work in case of the data isn't continuous (there are plenty of breaks in the series). 如果可能的话,我希望这可以在数据不连续的情况下工作(系列中有很多中断)。

I was trying to find the mode of all the differences between two timestamp but I am not sure how to transfer it into a format that readable by Series 我试图找到两个时间戳之间所有差异的模式,但我不知道如何将其转换为系列可读的格式

Maybe try taking difference of the timeindex and use the mode (or smallest difference) as the freq. 也许尝试区分时间索引并使用模式(或最小差异)作为频率。

import pandas as pd
import numpy as np

# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df

                        col
2015-03-02 01:00:00  2.0261
2015-03-02 04:00:00  1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00  0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00  1.8453
...                     ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00  1.1962
2015-07-19 15:00:00  1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00  0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00  0.5049
2015-07-19 23:00:00 -0.5349

[2000 rows x 1 columns]

# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()

01:00:00    1181
02:00:00     499
03:00:00     180
04:00:00      93
05:00:00      24
06:00:00      10
07:00:00       9
08:00:00       3
dtype: int64

# the mode can be considered as frequency
res.index[0]  # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min()  # output: Timedelta('0 days 01:00:00')




# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng

DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
               '2015-03-02 03:00:00', '2015-03-02 04:00:00',
               '2015-03-02 05:00:00', '2015-03-02 06:00:00',
               '2015-03-02 07:00:00', '2015-03-02 08:00:00',
               '2015-03-02 09:00:00', '2015-03-02 10:00:00', 
               ...
               '2015-07-19 14:00:00', '2015-07-19 15:00:00',
               '2015-07-19 16:00:00', '2015-07-19 17:00:00',
               '2015-07-19 18:00:00', '2015-07-19 19:00:00',
               '2015-07-19 20:00:00', '2015-07-19 21:00:00',
               '2015-07-19 22:00:00', '2015-07-19 23:00:00'],
              dtype='datetime64[ns]', length=3359, freq='H', tz=None)

It is worth mentioning that if data is continuous, you can use pandas.DateTimeIndex.inferred_freq property: 值得一提的是,如果数据是连续的,您可以使用pandas.DateTimeIndex.inferred_freq属性:

dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'

or pandas.infer_freq method: 或者pandas.infer_freq方法:

pd.infer_freq(dt_ix)
Out[3]: 'H'

If not continuous pandas.infer_freq will return None. 如果不连续,pandas.infer_freq将返回None。 Similarly to what has been proposed yet, another alternative is using pandas.Series.diff method: 与已经提出的内容类似,另一种方法是使用pandas.Series.diff方法:

split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')

The minimum time difference is found with 找到最小时差

np.diff(data.index.values).min()

which is normally in units of ns. 通常以ns为单位。 To get a frequency, assuming ns: 要获得频率,假设ns:

freq = 1e9 / np.diff(df.index.values).min().astype(int)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM