简体   繁体   English

Python - 如何将 dataframe 中的每日值与字典中的每小时百分比相乘以获得 dataframe 和每小时值

[英]Python - how to multiply daily values in dataframe with hourly percentages in a dictionary to get dataframe with hourly values

I have a dataframe of daily transit ridership data for each station of a city and I also have a dictionary with the hourly ridership distribution in percentages.我有一个城市每个车站的每日过境乘客数据 dataframe,我还有一本字典,其中包含每小时乘客人数分布的百分比。

I would like to create a dataframe of hourly transit ridership for each station by multiplying the daily ridership values in the dataframe with the hourly predictions in the dictionary.我想通过将 Z6A8064B5DF4794555500553C47C55057DZ 中的每日客流量值与字典中的每小时预测相乘,为每个车站创建每小时过境客流量的 dataframe。

For instance, the data frame looks as follows:例如,数据框如下所示:

    Austin-Forest Park  Harlem-Lake
date        
2018-11-01  2248.0  4021.0
2018-11-02  1983.0  3850.0
2018-11-03  837.0   2308.0
2018-11-04  604.0   1443.0

And the hourly percentage ridership distribution looks like this with each key/value combination being a certain hour and % of daily ridership.每小时的乘客百分比分布看起来像这样,每个键/值组合都是特定的小时和每日乘客的百分比。

hourly_distribution = {0:0.017, 1:0.017, 2:0.008, 3:0.008, 4:0.004, 
                          5:0.004, 6:0.008, 7:0.021, 8:0.051, 9:0.042,
                          10:0.042, 11:0.038, 12:0.034, 13:0.038, 14:0.051, 
                          15:0.068, 16:0.084, 17:0.11, 18:0.101, 19:0.084,
                          20:0.059, 21:0.051, 22:0.034, 23:0.025}


hourly_distribution_weekend_days = {0:0.015, 1:0.015, 2:0.008, 3:0.008,4:0.008, 5:0.008, 
                         6:0.015, 7:0.023, 8:0.038, 9:0.046, 10:0.054, 
                         11:0.077, 12:0.092, 13:0.092, 14:0.092, 15:0.092,
                         16:0.062, 17:0.054, 18:0.054, 19:0.054, 20:0.031, 
                         21:0.031, 22:0.015, 23:0.015}

My expected outcome would then be this for Austin-Forest Park on 2018-11-01:我的预期结果将是 2018 年 11 月 1 日奥斯汀森林公园的结果:

    Austin-Forest Park
Date    
2018-11-01 00:00:00 38.2
2018-11-01 01:00:00 38.2
2018-11-01 02:00:00 18.0
2018-11-01 03:00:00 18.0
2018-11-01 04:00:00 9.0
2018-11-01 05:00:00 9.0
2018-11-01 06:00:00 18.0
2018-11-01 07:00:00 47.2
2018-11-01 08:00:00 114.6
2018-11-01 09:00:00 94.4
2018-11-01 10:00:00 94.4
2018-11-01 11:00:00 85.4
2018-11-01 12:00:00 76.4
2018-11-01 13:00:00 85.4
2018-11-01 14:00:00 114.6
2018-11-01 15:00:00 152.9
2018-11-01 16:00:00 188.8
2018-11-01 17:00:00 247.3
2018-11-01 18:00:00 227.0
2018-11-01 19:00:00 188.8
2018-11-01 20:00:00 132.6
2018-11-01 21:00:00 114.6
2018-11-01 22:00:00 76.4
2018-11-01 23:00:00 56.2

From this small sample, the expected shape of the new dataframe would then be (96,2) with 2 columns and 4 days x 24 hours of hourly ridership values.从这个小样本中,新 dataframe 的预期形状将是 (96,2),具有 2 列和 4 天 x 24 小时每小时客流量值。

Would anyone have any idea how to write this in Python?有人知道如何在 Python 中写这个吗?

Thank you!谢谢!

You can use numpy.outer for the product and list comprehension with pandas.to_datetime to build the new datetime index as follow:您可以将numpy.outer用于产品和列表理解,并使用pandas.to_datetime来构建新的日期时间索引,如下所示:

import pandas as pd
import numpy as np
import datetime

idx = pd.to_datetime(['2018-11-01', '2018-11-02', '2018-11-03', '2018-11-04'])
df_daily = pd.DataFrame({'Austin-Forest Park': [2248.0, 1983.0, 837.0, 604.0],
                         'Harlem-Lake': [4021.0, 3850.0, 2308.0, 1443.0]},
                         index=idx)
df_daily.index.name = 'date'


hourly_distribution = {0:0.017, 1:0.017, 2:0.008, 3:0.008, 4:0.004,
                          5:0.004, 6:0.008, 7:0.021, 8:0.051, 9:0.042,
                          10:0.042, 11:0.038, 12:0.034, 13:0.038, 14:0.051,
                          15:0.068, 16:0.084, 17:0.11, 18:0.101, 19:0.084,
                          20:0.59, 21:0.051, 22:0.034, 23:0.025}

distrib = [hourly_distribution[key] for key in hourly_distribution]

datetime_idx = pd.to_datetime([datetime.datetime(i.year, i.month, i.day, key) for i in idx for key in hourly_distribution])
data = np.outer(df_daily['Austin-Forest Park'], distrib).ravel()

df = pd.DataFrame({'Austin-Forest Park': data}, index=datetime_idx)
df.index.name = 'date'

which outputs哪个输出

                     Austin-Forest Park
date                                   
2018-11-01 00:00:00              38.216
2018-11-01 01:00:00              38.216
2018-11-01 02:00:00              17.984
2018-11-01 03:00:00              17.984
2018-11-01 04:00:00               8.992
...                                 ...
2018-11-04 19:00:00              50.736
2018-11-04 20:00:00             356.360
2018-11-04 21:00:00              30.804
2018-11-04 22:00:00              20.536
2018-11-04 23:00:00              15.100

[96 rows x 1 columns]

Imagining you have following dataframe definition:想象一下您有以下 dataframe 定义:

import pandas as pd

df_daily = pd.Series([2248, 1983, 837, 604], index=pd.date_range(start='2018-11-01', end='2018-11-04'))

You can do:你可以做:

df_daily = (
    df_daily
        .resample('H', closed='right').ffill()
        .to_frame(name='park')
        .assign(hour=lambda df: df.index.hour)
        .apply(lambda x: hourly_distribution[x['hour']]*x['park'], axis=1)
)

df_daily

在此处输入图像描述

Explanation:解释:

  • First, you upsample your data to an hourly basis using resample.首先,您使用重采样将数据上采样到每小时一次。 .fffill will use the day value for all hours. .fffill 将使用所有小时的日期值。
  • Then you create a column named hour .然后创建一个名为hour的列。
  • You use hour column to find equivalent percentage on hourly_distribution dictionary and multiply it by total day visitors that I just called park .您使用hour列在 hourly_distribution 字典中查找等效百分比,然后将其乘以我刚刚称为park的总日访问者。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM