简体   繁体   English

在 Pandas DataFrame 中查找连续日期组

[英]Find group of consecutive dates in Pandas DataFrame

I am trying to get the chunks of data where there's consecutive dates from the Pandas DataFrame.我试图从 Pandas DataFrame 中获取连续日期的数据块。 My df looks like below.我的df如下所示。

      DateAnalyzed           Val
1       2018-03-18      0.470253
2       2018-03-19      0.470253
3       2018-03-20      0.470253
4       2018-09-25      0.467729
5       2018-09-26      0.467729
6       2018-09-27      0.467729

In this df , I want to get the first 3 rows, do some processing and then get the last 3 rows and do processing on that.在这个df ,我想获取前 3 行,进行一些处理,然后获取最后 3 行并对其进行处理。

I calculated the difference with 1 lag by applying following code.我通过应用以下代码计算了 1 个滞后的差异。

df['Delta']=(df['DateAnalyzed'] - df['DateAnalyzed'].shift(1))

But after then I can't figure out that how to get the groups of consecutive rows without iterating.但在那之后我无法弄清楚如何在不迭代的情况下获取连续行的组。

It seems like you need two boolean masks: one to determine the breaks between groups, and one to determine which dates are in a group in the first place.似乎您需要两个布尔掩码:一个用于确定组之间的间隔,另一个用于确定哪些日期在第一组中。

There's also one tricky part that can be fleshed out by example.还有一个棘手的部分可以通过示例来充实。 Notice that df below contains an added row that doesn't have any consecutive dates before or after it.请注意,下面的df包含一个添加的行,该行之前或之后没有任何连续的日期。

>>> df
  DateAnalyzed       Val
1   2018-03-18  0.470253
2   2018-03-19  0.470253
3   2018-03-20  0.470253
4   2017-01-20  0.485949  # < watch out for this
5   2018-09-25  0.467729
6   2018-09-26  0.467729
7   2018-09-27  0.467729

>>> df.dtypes
DateAnalyzed    datetime64[ns]
Val                    float64
dtype: object

The answer below assumes that you want to ignore 2017-01-20 completely, without processing it.下面的答案假设您想完全忽略2017-01-20 ,而不对其进行处理。 (See end of answer for a solution if you do want to process this date.) (如果您确实想处理此日期,请参阅解决方案的结尾。)

First:第一的:

>>> dt = df['DateAnalyzed']
>>> day = pd.Timedelta('1d')
>>> in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
>>> in_block
1     True
2     True
3     True
4    False
5     True
6     True
7     True
Name: DateAnalyzed, dtype: bool

Now, in_block will tell you which dates are in a "consecutive" block, but it won't tell you to which groups each date belongs.现在, in_block会告诉您哪些日期在“连续”块中,但不会告诉您每个日期属于哪个组。

The next step is to derive the groupings themselves:下一步是派生分组本身:

>>> filt = df.loc[in_block]
>>> breaks = filt['DateAnalyzed'].diff() != day
>>> groups = breaks.cumsum()
>>> groups
1    1
2    1
3    1
5    2
6    2
7    2
Name: DateAnalyzed, dtype: int64

Then you can call df.groupby(groups) with your operation of choice.然后您可以使用您选择的操作调用df.groupby(groups)

>>> for _, frame in filt.groupby(groups):
...     print(frame, end='\n\n')
... 
  DateAnalyzed       Val
1   2018-03-18  0.470253
2   2018-03-19  0.470253
3   2018-03-20  0.470253

  DateAnalyzed       Val
5   2018-09-25  0.467729
6   2018-09-26  0.467729
7   2018-09-27  0.467729

To incorporate this back into df , assign to it and the isolated dates will be NaN :要将其合并回df ,分配给它,隔离日期将为NaN

>>> df['groups'] = groups
>>> df
  DateAnalyzed       Val  groups
1   2018-03-18  0.470253     1.0
2   2018-03-19  0.470253     1.0
3   2018-03-20  0.470253     1.0
4   2017-01-20  0.485949     NaN
5   2018-09-25  0.467729     2.0
6   2018-09-26  0.467729     2.0
7   2018-09-27  0.467729     2.0

If you do want to include the "lone" date, things become a bit more straightforward:如果您确实想包括“单独”日期,事情会变得更加简单:

dt = df['DateAnalyzed']
day = pd.Timedelta('1d')
breaks = dt.diff() != day
groups = breaks.cumsum()

There were similar questions after this one here and here , with more specific outputs requirements.这里这里之后有类似的问题,有更具体的输出要求。 Since this one is more general, I would like to contribute here as well.由于这个更一般,我也想在这里做出贡献。

We can easily assign an unique identifier to consecutive groups with one-line code:我们可以使用一行代码轻松地为连续的组分配一个唯一标识符:

df['grp_date'] = df.DateAnalyzed.diff().dt.days.ne(1).cumsum()

Here, every time we see a date with a difference greater than a day, we add a value to that date, otherwise it remains with the previous value so that we end up with a unique identifier per group.在这里,每次我们看到差异大于一天的日期时,我们都会为该日期添加一个值,否则它会保留之前的值,以便我们最终获得每个组的唯一标识符。

See the output:查看输出:

  DateAnalyzed       Val  grp_date
1   2018-03-18  0.470253         1
2   2018-03-19  0.470253         1
3   2018-03-20  0.470253         1
4   2018-09-25  0.467729         2
5   2018-09-26  0.467729         2
6   2018-09-27  0.467729         2

Now, it's easy to groupby "grp_date" and do whatever you wanna do with apply or agg .现在,很容易groupby “grp_date”,做任何你想用做applyagg


Examples:例子:

# Sum across consecutive days (or any other method from pandas groupby)
df.groupby('grp_date').sum()

# Get the first value and last value per consecutive days
df.groupby('grp_date').apply(lambda x: x.iloc[[0, -1]])
# or df.groupby('grp_date').head(n) for first n days

# Perform custom operation across target-columns
df.groupby('grp_date').apply(lambda x: (x['col1'] + x['col2']) / x['Val'].mean())

# Multiple operations for a target-column
df.groupby('grp_date').Val.agg(['min', 'max', 'mean', 'std'])

# and so on...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM