[英]Selecting only date ranges in Pandas that have data every consecutive minute
I'm trying to process some data in pandas that looks like this in the CSV: 我正在尝试处理CSV中看起来像这样的大熊猫中的一些数据:
2014.01.02,08:56,1.37549,1.37552,1.37549,1.37552,3
2014.01.02,09:00,1.37562,1.37562,1.37545,1.37545,21
2014.01.02,09:01,1.37545,1.37550,1.37542,1.37546,18
2014.01.02,09:02,1.37546,1.37550,1.37546,1.37546,15
2014.01.02,09:03,1.37546,1.37563,1.37546,1.37559,39
2014.01.02,09:04,1.37559,1.37562,1.37555,1.37561,37
2014.01.02,09:05,1.37561,1.37564,1.37558,1.37561,35
2014.01.02,09:06,1.37561,1.37566,1.37558,1.37563,38
2014.01.02,09:07,1.37563,1.37567,1.37561,1.37566,42
2014.01.02,09:08,1.37570,1.37571,1.37564,1.37566,25
I imported it using: 我使用以下命令导入了它:
raw_data = pd.read_csv('raw_data.csv', engine='c', header=None, index_col=0, names=['date', 'time', 'open', 'high', 'low', 'close', 'volume'], parse_dates=[[0,1]])
But now I want to extract some random (or even continuous) samples from the data, but only the ones where I have 5 consecutive minutes always with data. 但是现在我想从数据中提取一些随机(甚至连续)的样本,但仅提取那些连续5分钟始终带有数据的样本。 So, for instance, the data from 2014.01.02,08:56
can't be used because it has a gap. 因此,例如, 2014.01.02,08:56
的数据存在差距,因此无法使用。 But the data from 2014.01.02,09:00
is ok because it has consecutive data always for the 5 next minutes. 但是2014.01.02,09:00
以来的数据还可以,因为它在接下来的5分钟内始终具有连续数据。
Any suggestions on how to accomplish this in a efficient way? 关于如何有效实现此目标的任何建议?
Here is one way by first .asfreq('T')
to populate some NaNs
and then using rolling_apply
and count whether the recent or next 5 observations has no NaNs
. 这是一种通过首先.asfreq('T')
填充某些NaNs
,然后使用rolling_apply
并计算最近或接下来的5个观测值是否没有NaNs
。
# populate NaNs at minutely freq
# ======================
df = raw_data.asfreq('T')
print(df)
open high low close volume
date_time
2014-01-02 08:56:00 1.3755 1.3755 1.3755 1.3755 3
2014-01-02 08:57:00 NaN NaN NaN NaN NaN
2014-01-02 08:58:00 NaN NaN NaN NaN NaN
2014-01-02 08:59:00 NaN NaN NaN NaN NaN
2014-01-02 09:00:00 1.3756 1.3756 1.3755 1.3755 21
2014-01-02 09:01:00 1.3755 1.3755 1.3754 1.3755 18
2014-01-02 09:02:00 1.3755 1.3755 1.3755 1.3755 15
2014-01-02 09:03:00 1.3755 1.3756 1.3755 1.3756 39
2014-01-02 09:04:00 1.3756 1.3756 1.3756 1.3756 37
2014-01-02 09:05:00 1.3756 1.3756 1.3756 1.3756 35
2014-01-02 09:06:00 1.3756 1.3757 1.3756 1.3756 38
2014-01-02 09:07:00 1.3756 1.3757 1.3756 1.3757 42
2014-01-02 09:08:00 1.3757 1.3757 1.3756 1.3757 25
consecutive_previous_5min = pd.rolling_apply(df['open'], 5, lambda g: np.isnan(g).any()) == 0
consecutive_previous_5min
date_time
2014-01-02 08:56:00 False
2014-01-02 08:57:00 False
2014-01-02 08:58:00 False
2014-01-02 08:59:00 False
2014-01-02 09:00:00 False
2014-01-02 09:01:00 False
2014-01-02 09:02:00 False
2014-01-02 09:03:00 False
2014-01-02 09:04:00 True
2014-01-02 09:05:00 True
2014-01-02 09:06:00 True
2014-01-02 09:07:00 True
2014-01-02 09:08:00 True
Freq: T, dtype: bool
# use the reverse trick to get the next 5 values
consecutive_next_5min = (pd.rolling_apply(df['open'][::-1], 5, lambda g: np.isnan(g).any()) == 0)[::-1]
consecutive_next_5min
date_time
2014-01-02 08:56:00 False
2014-01-02 08:57:00 False
2014-01-02 08:58:00 False
2014-01-02 08:59:00 False
2014-01-02 09:00:00 True
2014-01-02 09:01:00 True
2014-01-02 09:02:00 True
2014-01-02 09:03:00 True
2014-01-02 09:04:00 True
2014-01-02 09:05:00 False
2014-01-02 09:06:00 False
2014-01-02 09:07:00 False
2014-01-02 09:08:00 False
Freq: T, dtype: bool
# keep rows with either have recent 5 or next 5 elements non-null
df.loc[consecutive_next_5min | consecutive_previous_5min]
open high low close volume
date_time
2014-01-02 09:00:00 1.3756 1.3756 1.3755 1.3755 21
2014-01-02 09:01:00 1.3755 1.3755 1.3754 1.3755 18
2014-01-02 09:02:00 1.3755 1.3755 1.3755 1.3755 15
2014-01-02 09:03:00 1.3755 1.3756 1.3755 1.3756 39
2014-01-02 09:04:00 1.3756 1.3756 1.3756 1.3756 37
2014-01-02 09:05:00 1.3756 1.3756 1.3756 1.3756 35
2014-01-02 09:06:00 1.3756 1.3757 1.3756 1.3756 38
2014-01-02 09:07:00 1.3756 1.3757 1.3756 1.3757 42
2014-01-02 09:08:00 1.3757 1.3757 1.3756 1.3757 25
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.