简体   繁体   English

如何在 Python/Pandas 中按连续日期对条目进行分组

[英]How to group entries by consecutive dates in Python/Pandas

I have a pandas series, called hot_days that looks like the following:我有一个名为hot_days的熊猫系列,如下所示:

0     1980-06-04
1     1981-08-05
2     1982-06-04
3     1982-06-05
4     1982-07-08
         ...    
294   2019-07-25
295   2019-08-24
296   2019-08-25
297   2019-08-26
298   2019-08-27

It is a list of dates where the temperature in a given location is above a threshold.它是给定位置的温度高于阈值的日期列表。 I want to detect and record when a heatwave occurs, which is when the temperature is over this threshold for three or more days.我想检测并记录热浪何时发生,也就是温度超过此阈值三天或更长时间。 I want to end up with a dataframe containing the date the heatwave started, and its length.我想得到一个包含热浪开始日期及其长度的数据框。 By applying:通过应用:

new_series = (hot_days == hot_days.shift(2)+pd.Timedelta("2 days")) * (hot_days.groupby((hot_days == hot_days.shift(2)+pd.Timedelta("2 days")).cumsum()).cumcount()+1)

I get the series:我得到这个系列:

1      0
2      0
3      0
4      0
      ..
294    1
295    0
296    0
297    1
298    1

Which has a 1 for dates during a heatwave, and 0 for dates that are not in a heatwave, which I believe is a step in the right direction.热浪期间的日期为1 ,不在热浪中的日期为0 ,我认为这是朝着正确方向迈出的一步。 However, since I'm new to pandas, I'm not quite sure how I can achieve my goal.但是,由于我是熊猫的新手,我不太确定如何实现我的目标。 I know I can use loops, however I understand this is 'un-pythonic' as loops are slow in Python, so I'd rather find a more elegant solution (although the dataset is small enough that loops will work in a reasonable amount of time).我知道我可以使用循环,但是我知道这是“非 Pythonic”,因为循环在 Python 中很慢,所以我宁愿找到一个更优雅的解决方案(尽管数据集足够小,循环将在合理的数量下工作时间)。

Let's call s the initial Series.让我们将s称为初始系列。

Identify the heat wave days:识别热浪天:

waves = s.eq(s.shift(1)+pd.DateOffset(days=1)) & s.eq(s.shift(2)+pd.DateOffset(days=2))

Create a DataFrame with wave and wave groups:使用 wave 和 wave 组创建一个 DataFrame:

df = pd.concat({'date': s,
                'wave': waves,
                'group': waves.diff(1).ne(0).cumsum()
                }, axis=1)

List the waves and their duration:列出波浪及其持续时间:

pd.DataFrame({gid: pd.Series({'start': g.iloc[0]['date'],
                              'end': g.iloc[-1]['date'],
                              'duration': len(g)})
              for gid, g in df[df['wave']].groupby('group')
              }).T

output:输出:

       start        end duration
2 2019-08-26 2019-08-27        2
         

NB.注意。 I have slightly different results due ton incomplete dataset由于数据集不完整,我的结果略有不同

edit: here is how the waves.diff(1).ne(0).cumsum() works:编辑:这是waves.diff(1).ne(0).cumsum()工作原理:

    bool   diff  diff_int  diff_not  diff_not_int  diff_not_cumsum
0   True    NaN       NaN      True             1                1
1  False   True      -1.0      True             1                2
2  False  False       0.0     False             0                2
3   True   True       1.0      True             1                3
4   True  False       0.0     False             0                3

We could use shifting to get an added count so you can later count up in a loop.我们可以使用移位来增加计数,以便您以后可以在循环中进行计数。

s = pd.Series([0,1,1,0,0,0,1,1,1,1,1,0,0,1,1,0,0,0,0,1,1,0,0,1,0,0])
s * (s.groupby((y != s.shift()).cumsum()).cumcount() + 1)

To get:要得到:

0     0
1     1
2     2
3     0
4     0
5     0
6     1
7     2
8     3
9     4
10    5
11    0
12    0
13    1
14    2
15    0
16    0
17    0
18    0
19    1
20    2
21    0
22    0
23    1
24    0
25    0

Or we can loop through a group by in order to get separate lists.或者我们可以遍历一个 group by 以获得单独的列表。

df = pd.DataFrame({"a":s})
for i, g in df.groupby([(df.a != df.a.shift()).cumsum()]):
    print (i,end="")
    print (g)
    print (g.a.tolist())
    print("--")

To get:要得到:

1   a
0  0
[0]
--
2   a
1  1
2  1
[1, 1]
--
3   a
3  0
4  0
5  0
[0, 0, 0]
--
4    a
6   1
7   1
8   1
9   1
10  1
[1, 1, 1, 1, 1]
--
5    a
11  0
12  0
[0, 0]
--
6    a
13  1
14  1
[1, 1]
--
7    a
15  0
16  0
17  0
18  0
[0, 0, 0, 0]
--
8    a
19  1
20  1
[1, 1]
--
9    a
21  0
22  0
[0, 0]
--
10    a
23  1
[1]
--
11    a
24  0
25  0
[0, 0]
--

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM