熊猫时间序列重采样

Question

I have a list of voyages with a start and end date and the earnings for that voyage.我有一个航程列表，其中包含开始日期和结束日期以及该航程的收入。 I want to calculate the monthly earnings, but I am not sure how I can do that using Pandas:我想计算每月收入，但我不确定如何使用 Pandas 来计算：

'2016-02-28 07:30:00', '2016-04-30 00:00:00', '600000'
'2016-05-18 10:30:00', '2016-07-12 02:19:00', '700000'

The way I manually do this is calculating how many days of the voyage is in each respective month and multiply by earnings/total length of voyage.我手动执行此操作的方法是计算每个月的航程天数，然后乘以收入/航程总长度。

Answer 1

You need check how many hours is in each date range - in each row.您需要检查每个日期范围内的小时数 - 每行。 So use DataFrame.apply with custom function, where groupby by months in date_range and aggreagate size .因此，使用DataFrame.apply自定义函数，其中groupby通过months在date_range和aggreagate size 。

print (df)
                start                 end   price
0 2016-02-28 07:30:00 2016-04-30 00:00:00  600000
1 2016-05-18 10:30:00 2016-07-12 02:19:00  700000

print (df.dtypes)
start    datetime64[ns]
end      datetime64[ns]
price             int64
dtype: object

def f(x):
    rng = pd.date_range(x.start, x.end, freq='H')
    return rng.to_series().groupby([rng.month]).size()
df1 = df.apply(f, axis=1)
print (df1)
      2      3      4      5      6      7
0  41.0  744.0  696.0    NaN    NaN    NaN
1   NaN    NaN    NaN  326.0  720.0  266.0

Then get price_per_hour by divide column price by sum of all hours:然后通过将列price除以所有小时sum得到price_per_hour ：

price_per_hour = df.price / df1.sum(axis=1)
print (price_per_hour)
0    405.131668
1    533.536585
dtype: float64

And last multiple by mul all hours in each month :并且month最后乘以mul所有小时数：

print (df1.mul(price_per_hour, axis=0))
              2              3              4              5              6  \
0  16610.398379  301417.960837  281971.640783            NaN            NaN   
1           NaN            NaN            NaN  173932.926829  384146.341463   

               7  
0            NaN  
1  141920.731707  

#check sum - it is correctly price
print (df1.mul(price_per_hour, axis=0).sum(axis=1))
0    600000.0
1    700000.0
dtype: float64

You can also count prices per days - change freq='h' to freq='D' , but I think it is less accurate:您也可以算prices每days -改变freq='h' ，以freq='D' ，但我认为这是不准确的：

def f(x):
    rng = pd.date_range(x.start, x.end, freq='D')
    return rng.to_series().groupby([rng.month]).size()

df1 = df.apply(f, axis=1)
print (df1)
     2     3     4     5     6     7
0  2.0  31.0  29.0   NaN   NaN   NaN
1  NaN   NaN   NaN  14.0  30.0  11.0

price_per_hour = df.price / df1.sum(axis=1)
print (price_per_hour)
0     9677.419355
1    12727.272727
dtype: float64

print (df1.mul(price_per_hour, axis=0))
             2         3             4              5              6         7
0  19354.83871  300000.0  280645.16129            NaN            NaN       NaN
1          NaN       NaN           NaN  178181.818182  381818.181818  140000.0
0    600000.0
1    700000.0
dtype: float64

print (df1.mul(price_per_hour, axis=0).sum(axis=1))
0    600000.0
1    700000.0
dtype: float64

Another solution with reshaping by melt , groupby and resample resample - also need groupby by months and aggreagate size :通过重塑另一种解决方案melt ，GROUPBY和重采样resample -也需要groupby通过months和aggreagate size ：

df['count'] = df.index
df1 = pd.melt(df, id_vars=['price', 'count'], value_name='dates')
print (df1)
    price  count variable               dates
0  600000      0    start 2016-02-28 07:30:00
1  700000      1    start 2016-05-18 10:30:00
2  600000      0      end 2016-04-30 00:00:00
3  700000      1      end 2016-07-12 02:19:00

df2 = df1.set_index('dates').groupby('count').resample('D').size()
print (df2)
count  dates     
0      2016-02-28    1
       2016-02-29    0
       2016-03-01    0
       2016-03-02    0
       2016-03-03    0
       2016-03-04    0
       2016-03-05    0
       2016-03-06    0
       2016-03-07    0
       2016-03-08    0
       2016-03-09    0
       2016-03-10    0
       2016-03-11    0
       2016-03-12    0
...
...

print (df2.index.get_level_values('dates').month)
[2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5
 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7
 7 7 7 7 7 7 7 7]

df3 = df2.groupby([df2.index.get_level_values('count'), 
                   df2.index.get_level_values('dates').month]).size().unstack()
print (df3)
         2     3     4     5     6     7
count                                   
0      2.0  31.0  30.0   NaN   NaN   NaN
1      NaN   NaN   NaN  14.0  30.0  12.0

price_per_hour = df.price / df3.sum(axis=1)
print (price_per_hour)
0     9523.809524
1    12500.000000
dtype: float64

print (df3.mul(price_per_hour, axis=0))
                  2              3              4         5         6  \
count                                                                   
0      19047.619048  295238.095238  285714.285714       NaN       NaN   
1               NaN            NaN            NaN  175000.0  375000.0   
              7  
count            
0           NaN  
1      150000.0  

print (df3.mul(price_per_hour, axis=0).sum(axis=1))
count
0    600000.0
1    700000.0
dtype: float64

熊猫时间序列重采样

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-10-12 06:24:08

熊猫时间序列重采样

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-10-12 06:24:08

解决方案1
2 已采纳 2016-10-12 06:24:08