Python Pandas：检测时间序列的频率

Question

Assume I have loaded a time series data from sql or csv (not created in python), the index would be: 假设我已经从sql或csv（不是在python中创建）加载了时间序列数据，索引将是：

DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
               '2015-03-02 02:00:00', '2015-03-02 03:00:00',
               '2015-03-02 04:00:00', '2015-03-02 05:00:00',
               '2015-03-02 06:00:00', '2015-03-02 07:00:00',
               '2015-03-02 08:00:00', '2015-03-02 09:00:00', 
               ...
               '2015-07-19 14:00:00', '2015-07-19 15:00:00',
               '2015-07-19 16:00:00', '2015-07-19 17:00:00',
               '2015-07-19 18:00:00', '2015-07-19 19:00:00',
               '2015-07-19 20:00:00', '2015-07-19 21:00:00',
               '2015-07-19 22:00:00', '2015-07-19 23:00:00'],
              dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)

As you can see, the 'freq' is None. 如您所见，'freq'为None。 I am wondering how can I detect the frequency of this series and set the 'freq' as its frequency. 我想知道如何检测此系列的频率并将'freq'设置为其频率。

If possible, I wish this would work in case of the data isn't continuous (there are plenty of breaks in the series). 如果可能的话，我希望这可以在数据不连续的情况下工作（系列中有很多中断）。

I was trying to find the mode of all the differences between two timestamp but I am not sure how to transfer it into a format that readable by Series 我试图找到两个时间戳之间所有差异的模式，但我不知道如何将其转换为系列可读的格式

Answer 1

Maybe try taking difference of the timeindex and use the mode (or smallest difference) as the freq. 也许尝试区分时间索引并使用模式（或最小差异）作为频率。

import pandas as pd
import numpy as np

# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df

                        col
2015-03-02 01:00:00  2.0261
2015-03-02 04:00:00  1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00  0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00  1.8453
...                     ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00  1.1962
2015-07-19 15:00:00  1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00  0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00  0.5049
2015-07-19 23:00:00 -0.5349

[2000 rows x 1 columns]

# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()

01:00:00    1181
02:00:00     499
03:00:00     180
04:00:00      93
05:00:00      24
06:00:00      10
07:00:00       9
08:00:00       3
dtype: int64

# the mode can be considered as frequency
res.index[0]  # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min()  # output: Timedelta('0 days 01:00:00')




# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng

DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
               '2015-03-02 03:00:00', '2015-03-02 04:00:00',
               '2015-03-02 05:00:00', '2015-03-02 06:00:00',
               '2015-03-02 07:00:00', '2015-03-02 08:00:00',
               '2015-03-02 09:00:00', '2015-03-02 10:00:00', 
               ...
               '2015-07-19 14:00:00', '2015-07-19 15:00:00',
               '2015-07-19 16:00:00', '2015-07-19 17:00:00',
               '2015-07-19 18:00:00', '2015-07-19 19:00:00',
               '2015-07-19 20:00:00', '2015-07-19 21:00:00',
               '2015-07-19 22:00:00', '2015-07-19 23:00:00'],
              dtype='datetime64[ns]', length=3359, freq='H', tz=None)

Answer 2

It is worth mentioning that if data is continuous, you can use pandas.DateTimeIndex.inferred_freq property: 值得一提的是，如果数据是连续的，您可以使用pandas.DateTimeIndex.inferred_freq属性：

dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'

or pandas.infer_freq method: 或者pandas.infer_freq方法：

pd.infer_freq(dt_ix)
Out[3]: 'H'

If not continuous pandas.infer_freq will return None. 如果不连续，pandas.infer_freq将返回None。 Similarly to what has been proposed yet, another alternative is using pandas.Series.diff method: 与已经提出的内容类似，另一种方法是使用pandas.Series.diff方法：

split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')

Answer 3

The minimum time difference is found with 找到最小时差

np.diff(data.index.values).min()

which is normally in units of ns. 通常以ns为单位。 To get a frequency, assuming ns: 要获得频率，假设ns：

freq = 1e9 / np.diff(df.index.values).min().astype(int)

Python Pandas：检测时间序列的频率

问题描述

3 个解决方案

解决方案1
5 已采纳 2015-07-20 13:40:24

解决方案2
5 2017-05-14 12:31:06

解决方案3
3 2015-07-20 14:39:00

Python Pandas：检测时间序列的频率

问题描述

3 个解决方案

解决方案1 5 已采纳 2015-07-20 13:40:24

解决方案2 5 2017-05-14 12:31:06

解决方案3 3 2015-07-20 14:39:00

解决方案1
5 已采纳 2015-07-20 13:40:24

解决方案2
5 2017-05-14 12:31:06

解决方案3
3 2015-07-20 14:39:00