简体   繁体   English

时间序列:每个 ID 号每天每小时的平均值

[英]Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough.我是一个有点初学者的程序员,正在学习 python (+pandas),希望我能很好地解释这一点。 I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years.我有一个超过 300 万行的大型时间序列 pd 数据框,最初有 12 列跨越数年。 This covers people taking a ticket from different locations denoted by Id numbers(350 of them).这包括从 ID 号(其中 350 个)表示的不同地点取票的人。 Each row is one instance (one ticket taken).每行是一个实例(一张票)。 I have searched many questions like counting records per hour per day and getting average per hour over several years .我搜索了许多问题,例如每天每小时计算记录数几年内每小时平均数 However, I run into the trouble of including the 'Id' variable.但是,我遇到了包含“Id”变量的麻烦。 I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.我正在寻找人们每小时、一周中的每一天(周一至周五)和每个车站的平均价值。
I have the following, setting datetime to index:我有以下内容,将日期时间设置为索引:

    Id          Start_date  Count  Day_name_no
    149 2011-12-31 21:30:00      1            5  
    150 2011-12-31 20:51:00      1            0  
    259 2011-12-31 20:48:00      1            1  
    3015 2011-12-31 19:38:00     1            4  
    28 2011-12-31 19:37:00       1            4  

Using groupby and Start_date.index.hour , I cant seem to include the 'Id'.使用groupbyStart_date.index.hour ,我似乎无法包含“Id”。

My alternative approach is to split the hour out of the date and have the following:我的替代方法是将日期分开一小时并具有以下内容:

    Id  Count  Day_name_no  Trip_hour
    149      1            2         5
    150      1            4         10
    153      1            2         15
    1867     1            4         11
    2387     1            2         7

I then get the count first with:然后我首先得到计数:

Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()

     Id Day_name_no Trip_hour   Count
     1  0           7          24
     1  0           8          48
     1  0           9          31
     1  0           10         28
     1  0           11         26
     1  0           12         25

Then use groupby and mean:然后使用 groupby 并表示:

Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()

However, this does not give the desired result as the mean values are incorrect.但是,由于平均值不正确,这并没有给出预期的结果。 I hope I have explained this issue in a clear way.我希望我已经清楚地解释了这个问题。 I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.我正在寻找每个 ID 每天每小时的平均值,因为我计划在对这些组应用预测模型之前进行聚类以将我的数据集分成组。

Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.任何帮助将不胜感激,如果可能的话,请解释我在代码方面或我的方法上做错了什么。

Thanks in advance.提前致谢。

I have edited this to try make it a little clearer.我已经编辑了这个,试图让它更清楚一点。 Writing a question with a lack of sleep is probably not advisable.写一个缺乏睡眠的问题可能是不可取的。 A toy dataset that i start with:我开始的玩具数据集:

    Date        Id     Dow Hour Count
    12/12/2014  1234    0   9   1
    12/12/2014  1234    0   9   1
    12/12/2014  1234    0   9   1
    12/12/2014  1234    0   9   1
    12/12/2014  1234    0   9   1
    19/12/2014  1234    0   9   1
    19/12/2014  1234    0   9   1
    19/12/2014  1234    0   9   1
    26/12/2014  1234    0   10  1
    27/12/2014  1234    1   11  1
    27/12/2014  1234    1   11  1
    27/12/2014  1234    1   11  1
    27/12/2014  1234    1   11  1
    04/01/2015  1234    1   11  1

I now realise I would have to use the date first and get something like:我现在意识到我必须先使用日期并得到类似的东西:

    Date         Id    Dow Hour Count
    12/12/2014  1234    0   9   5
    19/12/2014  1234    0   9   3
    26/12/2014  1234    0   10  1
    27/12/2014  1234    1   11  4
    04/01/2015  1234    1   11  1

And then calculate the mean per Id, per Dow, per hour.然后计算每个 Id、每个 Dow、每小时的平均值。 And want to get this:并想得到这个:

    Id  Dow Hour    Mean
    1234    0   9   4
    1234    0   10  1
    1234    1   11  2.5

I hope this makes it a bit clearer.我希望这使它更清楚一点。 My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.我的真实数据集跨越 3 年,包含 300 万行,包含 350 个 ID 号。

Your question is not very clear, but I hope this helps:你的问题不是很清楚,但我希望这会有所帮助:

df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()

您可以使用“Id”列使用 groupby 函数,然后将resample函数与 how=“sum”一起使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM