在 Pandas DataFrame 中查找连续日期组

Question

I am trying to get the chunks of data where there's consecutive dates from the Pandas DataFrame.我试图从 Pandas DataFrame 中获取连续日期的数据块。 My df looks like below.我的df如下所示。

      DateAnalyzed           Val
1       2018-03-18      0.470253
2       2018-03-19      0.470253
3       2018-03-20      0.470253
4       2018-09-25      0.467729
5       2018-09-26      0.467729
6       2018-09-27      0.467729

In this df , I want to get the first 3 rows, do some processing and then get the last 3 rows and do processing on that.在这个df ，我想获取前 3 行，进行一些处理，然后获取最后 3 行并对其进行处理。

I calculated the difference with 1 lag by applying following code.我通过应用以下代码计算了 1 个滞后的差异。

df['Delta']=(df['DateAnalyzed'] - df['DateAnalyzed'].shift(1))

But after then I can't figure out that how to get the groups of consecutive rows without iterating.但在那之后我无法弄清楚如何在不迭代的情况下获取连续行的组。

Answer 1

It seems like you need two boolean masks: one to determine the breaks between groups, and one to determine which dates are in a group in the first place.似乎您需要两个布尔掩码：一个用于确定组之间的间隔，另一个用于确定哪些日期在第一组中。

There's also one tricky part that can be fleshed out by example.还有一个棘手的部分可以通过示例来充实。 Notice that df below contains an added row that doesn't have any consecutive dates before or after it.请注意，下面的df包含一个添加的行，该行之前或之后没有任何连续的日期。

>>> df
  DateAnalyzed       Val
1   2018-03-18  0.470253
2   2018-03-19  0.470253
3   2018-03-20  0.470253
4   2017-01-20  0.485949  # < watch out for this
5   2018-09-25  0.467729
6   2018-09-26  0.467729
7   2018-09-27  0.467729

>>> df.dtypes
DateAnalyzed    datetime64[ns]
Val                    float64
dtype: object

The answer below assumes that you want to ignore 2017-01-20 completely, without processing it.下面的答案假设您想完全忽略2017-01-20 ，而不对其进行处理。 (See end of answer for a solution if you do want to process this date.) （如果您确实想处理此日期，请参阅解决方案的结尾。）

First:第一的：

>>> dt = df['DateAnalyzed']
>>> day = pd.Timedelta('1d')
>>> in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
>>> in_block
1     True
2     True
3     True
4    False
5     True
6     True
7     True
Name: DateAnalyzed, dtype: bool

Now, in_block will tell you which dates are in a "consecutive" block, but it won't tell you to which groups each date belongs.现在， in_block会告诉您哪些日期在“连续”块中，但不会告诉您每个日期属于哪个组。

The next step is to derive the groupings themselves:下一步是派生分组本身：

>>> filt = df.loc[in_block]
>>> breaks = filt['DateAnalyzed'].diff() != day
>>> groups = breaks.cumsum()
>>> groups
1    1
2    1
3    1
5    2
6    2
7    2
Name: DateAnalyzed, dtype: int64

Then you can call df.groupby(groups) with your operation of choice.然后您可以使用您选择的操作调用df.groupby(groups) 。

>>> for _, frame in filt.groupby(groups):
...     print(frame, end='\n\n')
... 
  DateAnalyzed       Val
1   2018-03-18  0.470253
2   2018-03-19  0.470253
3   2018-03-20  0.470253

  DateAnalyzed       Val
5   2018-09-25  0.467729
6   2018-09-26  0.467729
7   2018-09-27  0.467729

To incorporate this back into df , assign to it and the isolated dates will be NaN :要将其合并回df ，分配给它，隔离日期将为NaN ：

>>> df['groups'] = groups
>>> df
  DateAnalyzed       Val  groups
1   2018-03-18  0.470253     1.0
2   2018-03-19  0.470253     1.0
3   2018-03-20  0.470253     1.0
4   2017-01-20  0.485949     NaN
5   2018-09-25  0.467729     2.0
6   2018-09-26  0.467729     2.0
7   2018-09-27  0.467729     2.0

If you do want to include the "lone" date, things become a bit more straightforward:如果您确实想包括“单独”日期，事情会变得更加简单：

dt = df['DateAnalyzed']
day = pd.Timedelta('1d')
breaks = dt.diff() != day
groups = breaks.cumsum()

Answer 2

There were similar questions after this one here and here , with more specific outputs requirements.在这里和这里之后有类似的问题，有更具体的输出要求。 Since this one is more general, I would like to contribute here as well.由于这个更一般，我也想在这里做出贡献。

We can easily assign an unique identifier to consecutive groups with one-line code:我们可以使用一行代码轻松地为连续的组分配一个唯一标识符：

df['grp_date'] = df.DateAnalyzed.diff().dt.days.ne(1).cumsum()

Here, every time we see a date with a difference greater than a day, we add a value to that date, otherwise it remains with the previous value so that we end up with a unique identifier per group.在这里，每次我们看到差异大于一天的日期时，我们都会为该日期添加一个值，否则它会保留之前的值，以便我们最终获得每个组的唯一标识符。

See the output:查看输出：

  DateAnalyzed       Val  grp_date
1   2018-03-18  0.470253         1
2   2018-03-19  0.470253         1
3   2018-03-20  0.470253         1
4   2018-09-25  0.467729         2
5   2018-09-26  0.467729         2
6   2018-09-27  0.467729         2

Now, it's easy to groupby "grp_date" and do whatever you wanna do with apply or agg .现在，很容易groupby “grp_date”，做任何你想用做apply或agg 。

Examples:例子：

# Sum across consecutive days (or any other method from pandas groupby)
df.groupby('grp_date').sum()

# Get the first value and last value per consecutive days
df.groupby('grp_date').apply(lambda x: x.iloc[[0, -1]])
# or df.groupby('grp_date').head(n) for first n days

# Perform custom operation across target-columns
df.groupby('grp_date').apply(lambda x: (x['col1'] + x['col2']) / x['Val'].mean())

# Multiple operations for a target-column
df.groupby('grp_date').Val.agg(['min', 'max', 'mean', 'std'])

# and so on...

在 Pandas DataFrame 中查找连续日期组

问题描述

2 个解决方案

解决方案1
18 已采纳 2018-10-20 00:56:53

解决方案2
3 2020-12-04 18:01:44

在 Pandas DataFrame 中查找连续日期组

问题描述

2 个解决方案

解决方案1 18 已采纳 2018-10-20 00:56:53

解决方案2 3 2020-12-04 18:01:44

解决方案1
18 已采纳 2018-10-20 00:56:53

解决方案2
3 2020-12-04 18:01:44