[英]Pandas time series resampling
I have a list of voyages with a start and end date and the earnings for that voyage.我有一个航程列表,其中包含开始日期和结束日期以及该航程的收入。 I want to calculate the monthly earnings, but I am not sure how I can do that using Pandas:
我想计算每月收入,但我不确定如何使用 Pandas 来计算:
'2016-02-28 07:30:00', '2016-04-30 00:00:00', '600000'
'2016-05-18 10:30:00', '2016-07-12 02:19:00', '700000'
The way I manually do this is calculating how many days of the voyage is in each respective month and multiply by earnings/total length of voyage.我手动执行此操作的方法是计算每个月的航程天数,然后乘以收入/航程总长度。
You need check how many hours is in each date range - in each row.您需要检查每个日期范围内的小时数 - 每行。 So use
DataFrame.apply
with custom function, where groupby
by months
in date_range
and aggreagate size
.因此,使用
DataFrame.apply
自定义函数,其中groupby
通过months
在date_range
和aggreagate size
。
print (df)
start end price
0 2016-02-28 07:30:00 2016-04-30 00:00:00 600000
1 2016-05-18 10:30:00 2016-07-12 02:19:00 700000
print (df.dtypes)
start datetime64[ns]
end datetime64[ns]
price int64
dtype: object
def f(x):
rng = pd.date_range(x.start, x.end, freq='H')
return rng.to_series().groupby([rng.month]).size()
df1 = df.apply(f, axis=1)
print (df1)
2 3 4 5 6 7
0 41.0 744.0 696.0 NaN NaN NaN
1 NaN NaN NaN 326.0 720.0 266.0
Then get price_per_hour
by divide column price
by sum
of all hours:然后通过将列
price
除以所有小时sum
得到price_per_hour
:
price_per_hour = df.price / df1.sum(axis=1)
print (price_per_hour)
0 405.131668
1 533.536585
dtype: float64
And last multiple by mul
all hours in each month
:并且
month
最后乘以mul
所有小时数:
print (df1.mul(price_per_hour, axis=0))
2 3 4 5 6 \
0 16610.398379 301417.960837 281971.640783 NaN NaN
1 NaN NaN NaN 173932.926829 384146.341463
7
0 NaN
1 141920.731707
#check sum - it is correctly price
print (df1.mul(price_per_hour, axis=0).sum(axis=1))
0 600000.0
1 700000.0
dtype: float64
You can also count prices
per days
- change freq='h'
to freq='D'
, but I think it is less accurate:您也可以算
prices
每days
-改变freq='h'
,以freq='D'
,但我认为这是不准确的:
def f(x):
rng = pd.date_range(x.start, x.end, freq='D')
return rng.to_series().groupby([rng.month]).size()
df1 = df.apply(f, axis=1)
print (df1)
2 3 4 5 6 7
0 2.0 31.0 29.0 NaN NaN NaN
1 NaN NaN NaN 14.0 30.0 11.0
price_per_hour = df.price / df1.sum(axis=1)
print (price_per_hour)
0 9677.419355
1 12727.272727
dtype: float64
print (df1.mul(price_per_hour, axis=0))
2 3 4 5 6 7
0 19354.83871 300000.0 280645.16129 NaN NaN NaN
1 NaN NaN NaN 178181.818182 381818.181818 140000.0
0 600000.0
1 700000.0
dtype: float64
print (df1.mul(price_per_hour, axis=0).sum(axis=1))
0 600000.0
1 700000.0
dtype: float64
Another solution with reshaping by melt
, groupby and resample resample
- also need groupby
by months
and aggreagate size
:通过重塑另一种解决方案
melt
,GROUPBY和重采样resample
-也需要groupby
通过months
和aggreagate size
:
df['count'] = df.index
df1 = pd.melt(df, id_vars=['price', 'count'], value_name='dates')
print (df1)
price count variable dates
0 600000 0 start 2016-02-28 07:30:00
1 700000 1 start 2016-05-18 10:30:00
2 600000 0 end 2016-04-30 00:00:00
3 700000 1 end 2016-07-12 02:19:00
df2 = df1.set_index('dates').groupby('count').resample('D').size()
print (df2)
count dates
0 2016-02-28 1
2016-02-29 0
2016-03-01 0
2016-03-02 0
2016-03-03 0
2016-03-04 0
2016-03-05 0
2016-03-06 0
2016-03-07 0
2016-03-08 0
2016-03-09 0
2016-03-10 0
2016-03-11 0
2016-03-12 0
...
...
print (df2.index.get_level_values('dates').month)
[2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5
5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7
7 7 7 7 7 7 7 7]
df3 = df2.groupby([df2.index.get_level_values('count'),
df2.index.get_level_values('dates').month]).size().unstack()
print (df3)
2 3 4 5 6 7
count
0 2.0 31.0 30.0 NaN NaN NaN
1 NaN NaN NaN 14.0 30.0 12.0
price_per_hour = df.price / df3.sum(axis=1)
print (price_per_hour)
0 9523.809524
1 12500.000000
dtype: float64
print (df3.mul(price_per_hour, axis=0))
2 3 4 5 6 \
count
0 19047.619048 295238.095238 285714.285714 NaN NaN
1 NaN NaN NaN 175000.0 375000.0
7
count
0 NaN
1 150000.0
print (df3.mul(price_per_hour, axis=0).sum(axis=1))
count
0 600000.0
1 700000.0
dtype: float64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.