如何使用Pandas的时间戳来按小时对数据帧进行分组

Question

I have the following dataframe structure that is indexed with a timestamp: 我有以下数据帧结构，用时间戳索引：

    neg neu norm    pol pos date
time                        
1520353341  0.000   1.000   0.0000  0.000000    0.000   
1520353342  0.121   0.879   -0.2960 0.347851    0.000   
1520353342  0.217   0.783   -0.6124 0.465833    0.000

I create a date from the timestamp: 我从时间戳创建一个日期：

data_frame['date'] = [datetime.datetime.fromtimestamp(d) for d in data_frame.time]

Result: 结果：

    neg neu norm    pol pos date
time                        
1520353341  0.000   1.000   0.0000  0.000000    0.000   2018-03-06 10:22:21
1520353342  0.121   0.879   -0.2960 0.347851    0.000   2018-03-06 10:22:22
1520353342  0.217   0.783   -0.6124 0.465833    0.000   2018-03-06 10:22:22

I want to group by hour , while getting the mean for all the values, except the timestamp , that should be the hour from where the group started. 我希望按小时分组 ，同时获取除时间戳之外的所有值的均值，该值应该是组开始的小时。 So this is the result I want to archive: 所以这是我要归档的结果：

    neg neu norm    pol pos
time                    
1520352000  0.027989    0.893233    0.122535    0.221079    0.078779
1520355600  0.028861    0.899321    0.103698    0.209353    0.071811

The closest I have gotten so far has been with this answer : 到目前为止，我得到的最接近的答案是：

data = data.groupby(data.date.dt.hour).mean()

Results: 结果：

    neg neu norm    pol pos
date                    
0   0.027989    0.893233    0.122535    0.221079    0.078779
1   0.028861    0.899321    0.103698    0.209353    0.071811

But I cant figure out how to keep the timestamp that takes in account he hour where the grouby started. 但我无法弄清楚如何保持时间戳考虑到煤矸石开始的时间。

Answer 1

I came across this gem, pd.DataFrame.resample , after I posted my round-to-hour solution. 在我发布了我的圆形解决方案之后，我遇到了这个gem， pd.DataFrame.resample 。

# Construct example dataframe
times = pd.date_range('1/1/2018', periods=5, freq='25min')
values = [4,8,3,4,1]
df = pd.DataFrame({'val':values}, index=times)

# Resample by hour and calculate medians
df.resample('H').median()

Or you can use groupby with Grouper if you don't want times as index: 或者，如果您不希望将时间作为索引，则可以将groupby与Grouper一起使用：

df = pd.DataFrame({'val':values, 'times':times})
df.groupby(pd.Grouper(level='times', freq='H')).median()

Answer 2

You can round the timestamp column down to the nearest hour: 您可以将时间戳列向下舍入到最近的小时：

import math
df.time = [math.floor(t/3600) * 3600 for t in df.time]

Or even simpler, using integer division: 甚至更简单，使用整数除法：

df.time = [(t//3600) * 3600 for t in df.time]

You can group by this column and thus preserve the timestamp. 您可以按此列进行分组，从而保留时间戳。

Answer 3

Did you try creating an hour column by: 您是否尝试通过以下方式创建小时列：

data_frame['hour'] = data_frame.date.dt.hour

Then grouping by hour like: 然后按小时分组，如：

data = data.groupby(data.hour).mean()

如何使用Pandas的时间戳来按小时对数据帧进行分组

问题描述

3 个解决方案

解决方案1
5 已采纳 2018-03-10 14:27:42

解决方案2
1 2018-03-07 22:00:18

解决方案3
0 2018-03-07 17:28:12

如何使用Pandas的时间戳来按小时对数据帧进行分组

问题描述

3 个解决方案

解决方案1 5 已采纳 2018-03-10 14:27:42

解决方案2 1 2018-03-07 22:00:18

解决方案3 0 2018-03-07 17:28:12

解决方案1
5 已采纳 2018-03-10 14:27:42

解决方案2
1 2018-03-07 22:00:18

解决方案3
0 2018-03-07 17:28:12