简体   繁体   English

如何使用 groupby/cut 将 Pandas DataFrame 日期分组到自定义日期范围箱中

[英]How to group Pandas DataFrame dates into custom date range bins using groupby/cut

I am trying to group dates with a custom range using groupby and cut with no success so far.我正在尝试使用groupby cut日期与自定义范围进行分组,但到目前为止没有成功。 From the error message being returned, I wonder if cut is trying to process my dates as a number.从返回的错误消息中,我想知道 cut 是否试图将我的日期作为数字处理。

I want to group df1['date'] by custom date ranges and then sum the df1['HDD'] values.我想按自定义日期范围对df1['date']进行分组,然后对df1['HDD']值求和。 The custom ranges are found in df2 :自定义范围可在df2中找到:

import pandas as pd
df1 = pd.DataFrame( {'date': ['2/1/2015', '3/2/2015', '3/3/2015', '3/4/2015','4/17/2015','5/12/2015'],
                             'HDD' : ['7.5','8','5','23','11','55']})
    HDD  date
0   7.5 2/1/2015
1   8   3/2/2015
2   5   3/3/2015
3   23  3/4/2015
4   11  4/17/2015
5   55  5/12/2015

df2 has the custom date ranges: df2具有自定义日期范围:

df2 = pd.DataFrame( {'Period': ['One','Two','Three','Four'],
                     'Start Dates': ['1/1/2015','2/15/2015','3/14/2015','4/14/2015'],
                     'End Dates' : ['2/14/2015','3/13/2015','4/13/2015','5/10/2015']})

    Period  Start Dates End Dates
0   One     1/1/2015    2/14/2015
1   Two     2/15/2015   3/13/2015
2   Three   3/14/2015   4/13/2015
3   Four    4/14/2015   5/10/2015

My Desired output is to group df1 by the custom date ranges and aggregate the HDD values for each Period.我的期望输出是按自定义日期范围对df1进行分组,并汇总每个时期的 HDD 值。 Should output something like this:应该输出如下内容:

   Period    HDD
0  One       7.5
1  Two       36
2  Three     0
3  Four      11

Here is one example of what I have tried to use custom grouping:这是我尝试使用自定义分组的一个示例:

df3 = df1.groupby(pd.cut(df1['date'], df2['Start Dates'])).agg({'HDD': sum})

...and here is the error I get: ...这是我得到的错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-103-55ea779bcd73> in <module>()
----> 1 df3 = df1.groupby(pd.cut(df1['date'], df2['Start Dates'])).agg({'HDD': sum})

/opt/conda/lib/python3.5/site-packages/pandas/tools/tile.py in cut(x, bins, right, labels, retbins, precision, include_lowest)
    112     else:
    113         bins = np.asarray(bins)
--> 114         if (np.diff(bins) < 0).any():
    115             raise ValueError('bins must increase monotonically.')
    116 

/opt/conda/lib/python3.5/site-packages/numpy/lib/function_base.py in diff(a, n, axis)
   1576         return diff(a[slice1]-a[slice2], n-1, axis=axis)
   1577     else:
-> 1578         return a[slice1]-a[slice2]
   1579 
   1580 

TypeError: unsupported operand type(s) for -: 'str' and 'str'
  • Is cut trying to process my date ranges as numbers? cut 是否试图将我的日期范围作为数字处理?
  • Do I need to explicitly convert my dates as datetime objects (tried this but maybe was going about it correctly)?我是否需要将我的日期显式转换为 datetime 对象(尝试过这个,但可能是正确的)?

Thanks for any suggestions offered!感谢您提供的任何建议!

This works if you convert all your dates form dtype string to datetime.如果您将所有日期形式的 dtype 字符串转换为日期时间,则此方法有效。

df1['date'] = pd.to_datetime(df1['date'])

df2['End Dates'] = pd.to_datetime(df2['End Dates'])

df2['Start Dates'] = pd.to_datetime(df2['Start Dates'])

df1['HDD'] = df1['HDD'].astype(float)

df1.groupby(pd.cut(df1['date'], df2['Start Dates'])).agg({'HDD': sum})

Output:输出:

                           HDD
date                          
(2015-01-01, 2015-02-15]   7.5
(2015-02-15, 2015-03-14]  36.0
(2015-03-14, 2015-04-14]   NaN

Adding labels:添加标签:

df1.groupby(pd.cut(df1['date'], df2['Start Dates'], labels=df2.iloc[:-1,1])).agg({'HDD': sum})

Output:输出:

        HDD
date       
One     7.5
Two    36.0
Three   NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM