[英]Find group of consecutive dates in Pandas DataFrame
I am trying to get the chunks of data where there's consecutive dates from the Pandas DataFrame.我试图从 Pandas DataFrame 中获取连续日期的数据块。 My
df
looks like below.我的
df
如下所示。
DateAnalyzed Val
1 2018-03-18 0.470253
2 2018-03-19 0.470253
3 2018-03-20 0.470253
4 2018-09-25 0.467729
5 2018-09-26 0.467729
6 2018-09-27 0.467729
In this df
, I want to get the first 3 rows, do some processing and then get the last 3 rows and do processing on that.在这个
df
,我想获取前 3 行,进行一些处理,然后获取最后 3 行并对其进行处理。
I calculated the difference with 1 lag by applying following code.我通过应用以下代码计算了 1 个滞后的差异。
df['Delta']=(df['DateAnalyzed'] - df['DateAnalyzed'].shift(1))
But after then I can't figure out that how to get the groups of consecutive rows without iterating.但在那之后我无法弄清楚如何在不迭代的情况下获取连续行的组。
It seems like you need two boolean masks: one to determine the breaks between groups, and one to determine which dates are in a group in the first place.似乎您需要两个布尔掩码:一个用于确定组之间的间隔,另一个用于确定哪些日期在第一组中。
There's also one tricky part that can be fleshed out by example.还有一个棘手的部分可以通过示例来充实。 Notice that
df
below contains an added row that doesn't have any consecutive dates before or after it.请注意,下面的
df
包含一个添加的行,该行之前或之后没有任何连续的日期。
>>> df
DateAnalyzed Val
1 2018-03-18 0.470253
2 2018-03-19 0.470253
3 2018-03-20 0.470253
4 2017-01-20 0.485949 # < watch out for this
5 2018-09-25 0.467729
6 2018-09-26 0.467729
7 2018-09-27 0.467729
>>> df.dtypes
DateAnalyzed datetime64[ns]
Val float64
dtype: object
The answer below assumes that you want to ignore 2017-01-20
completely, without processing it.下面的答案假设您想完全忽略
2017-01-20
,而不对其进行处理。 (See end of answer for a solution if you do want to process this date.) (如果您确实想处理此日期,请参阅解决方案的结尾。)
First:第一的:
>>> dt = df['DateAnalyzed']
>>> day = pd.Timedelta('1d')
>>> in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
>>> in_block
1 True
2 True
3 True
4 False
5 True
6 True
7 True
Name: DateAnalyzed, dtype: bool
Now, in_block
will tell you which dates are in a "consecutive" block, but it won't tell you to which groups each date belongs.现在,
in_block
会告诉您哪些日期在“连续”块中,但不会告诉您每个日期属于哪个组。
The next step is to derive the groupings themselves:下一步是派生分组本身:
>>> filt = df.loc[in_block]
>>> breaks = filt['DateAnalyzed'].diff() != day
>>> groups = breaks.cumsum()
>>> groups
1 1
2 1
3 1
5 2
6 2
7 2
Name: DateAnalyzed, dtype: int64
Then you can call df.groupby(groups)
with your operation of choice.然后您可以使用您选择的操作调用
df.groupby(groups)
。
>>> for _, frame in filt.groupby(groups):
... print(frame, end='\n\n')
...
DateAnalyzed Val
1 2018-03-18 0.470253
2 2018-03-19 0.470253
3 2018-03-20 0.470253
DateAnalyzed Val
5 2018-09-25 0.467729
6 2018-09-26 0.467729
7 2018-09-27 0.467729
To incorporate this back into df
, assign to it and the isolated dates will be NaN
:要将其合并回
df
,分配给它,隔离日期将为NaN
:
>>> df['groups'] = groups
>>> df
DateAnalyzed Val groups
1 2018-03-18 0.470253 1.0
2 2018-03-19 0.470253 1.0
3 2018-03-20 0.470253 1.0
4 2017-01-20 0.485949 NaN
5 2018-09-25 0.467729 2.0
6 2018-09-26 0.467729 2.0
7 2018-09-27 0.467729 2.0
If you do want to include the "lone" date, things become a bit more straightforward:如果您确实想包括“单独”日期,事情会变得更加简单:
dt = df['DateAnalyzed']
day = pd.Timedelta('1d')
breaks = dt.diff() != day
groups = breaks.cumsum()
There were similar questions after this one here and here , with more specific outputs requirements.在这里和这里之后有类似的问题,有更具体的输出要求。 Since this one is more general, I would like to contribute here as well.
由于这个更一般,我也想在这里做出贡献。
We can easily assign an unique identifier to consecutive groups with one-line code:我们可以使用一行代码轻松地为连续的组分配一个唯一标识符:
df['grp_date'] = df.DateAnalyzed.diff().dt.days.ne(1).cumsum()
Here, every time we see a date with a difference greater than a day, we add a value to that date, otherwise it remains with the previous value so that we end up with a unique identifier per group.在这里,每次我们看到差异大于一天的日期时,我们都会为该日期添加一个值,否则它会保留之前的值,以便我们最终获得每个组的唯一标识符。
See the output:查看输出:
DateAnalyzed Val grp_date
1 2018-03-18 0.470253 1
2 2018-03-19 0.470253 1
3 2018-03-20 0.470253 1
4 2018-09-25 0.467729 2
5 2018-09-26 0.467729 2
6 2018-09-27 0.467729 2
Now, it's easy to groupby
"grp_date" and do whatever you wanna do with apply
or agg
.现在,很容易
groupby
“grp_date”,做任何你想用做apply
或agg
。
Examples:例子:
# Sum across consecutive days (or any other method from pandas groupby)
df.groupby('grp_date').sum()
# Get the first value and last value per consecutive days
df.groupby('grp_date').apply(lambda x: x.iloc[[0, -1]])
# or df.groupby('grp_date').head(n) for first n days
# Perform custom operation across target-columns
df.groupby('grp_date').apply(lambda x: (x['col1'] + x['col2']) / x['Val'].mean())
# Multiple operations for a target-column
df.groupby('grp_date').Val.agg(['min', 'max', 'mean', 'std'])
# and so on...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.