[英]How to group Pandas DataFrame dates into custom date range bins using groupby/cut
I am trying to group dates with a custom range using groupby
and cut
with no success so far.我正在尝试使用groupby
cut
日期与自定义范围进行分组,但到目前为止没有成功。 From the error message being returned, I wonder if cut is trying to process my dates as a number.从返回的错误消息中,我想知道 cut 是否试图将我的日期作为数字处理。
I want to group df1['date']
by custom date ranges and then sum the df1['HDD']
values.我想按自定义日期范围对df1['date']
进行分组,然后对df1['HDD']
值求和。 The custom ranges are found in df2
:自定义范围可在df2
中找到:
import pandas as pd
df1 = pd.DataFrame( {'date': ['2/1/2015', '3/2/2015', '3/3/2015', '3/4/2015','4/17/2015','5/12/2015'],
'HDD' : ['7.5','8','5','23','11','55']})
HDD date
0 7.5 2/1/2015
1 8 3/2/2015
2 5 3/3/2015
3 23 3/4/2015
4 11 4/17/2015
5 55 5/12/2015
df2
has the custom date ranges: df2
具有自定义日期范围:
df2 = pd.DataFrame( {'Period': ['One','Two','Three','Four'],
'Start Dates': ['1/1/2015','2/15/2015','3/14/2015','4/14/2015'],
'End Dates' : ['2/14/2015','3/13/2015','4/13/2015','5/10/2015']})
Period Start Dates End Dates
0 One 1/1/2015 2/14/2015
1 Two 2/15/2015 3/13/2015
2 Three 3/14/2015 4/13/2015
3 Four 4/14/2015 5/10/2015
My Desired output is to group df1
by the custom date ranges and aggregate the HDD values for each Period.我的期望输出是按自定义日期范围对df1
进行分组,并汇总每个时期的 HDD 值。 Should output something like this:应该输出如下内容:
Period HDD
0 One 7.5
1 Two 36
2 Three 0
3 Four 11
Here is one example of what I have tried to use custom grouping:这是我尝试使用自定义分组的一个示例:
df3 = df1.groupby(pd.cut(df1['date'], df2['Start Dates'])).agg({'HDD': sum})
...and here is the error I get: ...这是我得到的错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-103-55ea779bcd73> in <module>()
----> 1 df3 = df1.groupby(pd.cut(df1['date'], df2['Start Dates'])).agg({'HDD': sum})
/opt/conda/lib/python3.5/site-packages/pandas/tools/tile.py in cut(x, bins, right, labels, retbins, precision, include_lowest)
112 else:
113 bins = np.asarray(bins)
--> 114 if (np.diff(bins) < 0).any():
115 raise ValueError('bins must increase monotonically.')
116
/opt/conda/lib/python3.5/site-packages/numpy/lib/function_base.py in diff(a, n, axis)
1576 return diff(a[slice1]-a[slice2], n-1, axis=axis)
1577 else:
-> 1578 return a[slice1]-a[slice2]
1579
1580
TypeError: unsupported operand type(s) for -: 'str' and 'str'
Thanks for any suggestions offered!感谢您提供的任何建议!
This works if you convert all your dates form dtype string to datetime.如果您将所有日期形式的 dtype 字符串转换为日期时间,则此方法有效。
df1['date'] = pd.to_datetime(df1['date'])
df2['End Dates'] = pd.to_datetime(df2['End Dates'])
df2['Start Dates'] = pd.to_datetime(df2['Start Dates'])
df1['HDD'] = df1['HDD'].astype(float)
df1.groupby(pd.cut(df1['date'], df2['Start Dates'])).agg({'HDD': sum})
Output:输出:
HDD
date
(2015-01-01, 2015-02-15] 7.5
(2015-02-15, 2015-03-14] 36.0
(2015-03-14, 2015-04-14] NaN
Adding labels:添加标签:
df1.groupby(pd.cut(df1['date'], df2['Start Dates'], labels=df2.iloc[:-1,1])).agg({'HDD': sum})
Output:输出:
HDD
date
One 7.5
Two 36.0
Three NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.