简体   繁体   English

具有多个日期范围的 Pandas.DataFrame 切片

[英]Pandas.DataFrame slicing with multiple date ranges

I have a datetime-indexed dataframe object with 100,000+ rows.我有一个包含 100,000 多行的日期时间索引数据帧对象。 I was wondering if there was a convenient way using pandas to get a subset of this dataframe that is within multiple date ranges.我想知道是否有一种方便的方法使用 Pandas 来获取多个日期范围内的这个数据帧的子集。

For example, let us say that we have two date ranges:例如,假设我们有两个日期范围:

(datetime.datetime(2016,6,27,0,0,0), datetime.datetime(2016,6,27,5,0,0)

and

(datetime.datetime(2016,6,27,15,0,0), datetime.datetime(2016,6,27,23,59,59)

Let us say we want to get all rows of a dataframe object that is in either the first date range or the second date range, where the dataframe object has rows for every second from 2016-06-27 00:00:00 to 2016-06-27 23:59:59 .假设我们想要获取第一个日期范围或第二个日期范围内的数据框对象的所有行,其中数据框对象从2016-06-27 00:00:002016-06-27 23:59:59每秒都有行2016-06-27 23:59:59 Is there an easy way in pandas to do this?大熊猫有没有一种简单的方法可以做到这一点?

There are two main ways to slice a DataFrame with a DatetimeIndex by date.两种主要方法可以按日期对带有 DatetimeIndex 的 DataFrame 进行切片。

  • by slices: df.loc[start:end] .按切片: df.loc[start:end] If there are multiple date ranges, the single slices may be concatenated with pd.concat .如果有多个日期范围,则可以将单个切片与pd.concat连接pd.concat

  • by boolean selection mask: df.loc[mask]通过布尔选择掩码: df.loc[mask]


Using pd.concat and slices :使用 pd.concat 和 slices

import numpy as np
import pandas as pd
np.random.seed(2016)

N = 10**2
df = pd.DataFrame(np.random.randint(10, size=(N, 2)), 
                  index=pd.date_range('2016-6-27', periods=N, freq='45T'))

result = pd.concat([df.loc['2016-6-27':'2016-6-27 5:00'],
                    df.loc['2016-6-27 15:00':'2016-6-27 23:59:59']])

yields产量

                     0  1
2016-06-27 00:00:00  0  2
2016-06-27 00:45:00  5  5
2016-06-27 01:30:00  9  6
2016-06-27 02:15:00  8  4
2016-06-27 03:00:00  5  0
2016-06-27 03:45:00  4  8
2016-06-27 04:30:00  7  0
2016-06-27 15:00:00  2  5
2016-06-27 15:45:00  6  7
2016-06-27 16:30:00  6  8
2016-06-27 17:15:00  5  1
2016-06-27 18:00:00  2  9
2016-06-27 18:45:00  9  1
2016-06-27 19:30:00  9  7
2016-06-27 20:15:00  3  6
2016-06-27 21:00:00  3  5
2016-06-27 21:45:00  0  8
2016-06-27 22:30:00  5  6
2016-06-27 23:15:00  0  8

Note that unlike most slicing syntaxes used in Python,请注意,与 Python 中使用的大多数切片语法不同,

df.loc['2016-6-27':'2016-6-27 5:00']

is inclusive on both ends -- the slice defines a closed interval, is not a half-open interval.在两端都包含在内——切片定义了一个闭区间,而不是半开区间。


Using a boolean selection mask:使用布尔选择掩码:

mask = (((df.index >= '2016-6-27') & (df.index <= '2016-6-27 5:00')) 
        | ((df.index >= '2016-6-27 15:00') & (df.index < '2016-6-28')))
result2 = df.loc[mask]
assert result.equals(result2)

I feel the best option will be to use the direct checks rather than using loc function:我觉得最好的选择是使用直接检查而不是使用 loc 函数:

df = df[((df.index >= '2016-6-27') & (df.index <= '2016-6-27 5:00')) 
    | ((df.index >= '2016-6-27 15:00') & (df.index < '2016-6-28'))]

It works for me.它对我有用。

Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.带有切片的 loc 函数的主要问题是限制应该存在于实际值中,否则将导致 KeyError。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM