熊猫将功能应用于按天分组的数据

Question

I have a dataset that looks like this: 我有一个看起来像这样的数据集：

date,value1,value2
2016-01-01 00:00:00,3,0
2016-01-01 01:00:00,0,0
2016-01-01 02:00:00,0,0
2016-01-01 03:00:00,0,0
2016-01-01 04:00:00,0,0
2016-01-01 05:00:00,0,0
2016-01-01 06:00:00,0,0
2016-01-01 07:00:00,0,2
2016-01-01 08:00:00,3,11
2016-01-01 09:00:00,14,14
2016-01-01 10:00:00,12,13
2016-01-01 11:00:00,11,13
2016-01-01 12:00:00,11,9
2016-01-01 13:00:00,17,21
2016-01-01 14:00:00,9,22
2016-01-01 15:00:00,10,9
2016-01-01 16:00:00,11,9
2016-01-01 17:00:00,8,8
2016-01-01 18:00:00,4,2
2016-01-01 19:00:00,5,7
2016-01-01 20:00:00,5,5
2016-01-01 21:00:00,3,4
2016-01-01 22:00:00,2,4
2016-01-01 23:00:00,2,4
2016-01-02 00:00:00,0,0
2016-01-02 01:00:00,0,0
2016-01-02 02:00:00,0,0
2016-01-02 03:00:00,0,0
2016-01-02 04:00:00,0,0
2016-01-02 05:00:00,0,0
2016-01-02 06:00:00,1,0
2016-01-02 07:00:00,0,0
2016-01-02 08:00:00,0,0
2016-01-02 09:00:00,0,0
2016-01-02 10:00:00,0,0
2016-01-02 11:00:00,0,0
2016-01-02 12:00:00,0,0
2016-01-02 13:00:00,1,0
2016-01-02 14:00:00,0,0
2016-01-02 15:00:00,0,0
2016-01-02 16:00:00,0,0
2016-01-02 17:00:00,0,0
2016-01-02 18:00:00,0,0
2016-01-02 19:00:00,0,0
2016-01-02 20:00:00,1,0
2016-01-02 21:00:00,0,0
2016-01-02 22:00:00,0,0
2016-01-02 23:00:00,0,0

What I want to do is calculate the rmse between value1 and value2 per day. 我想做的是每天计算出value1和value2之间的均方根值。 So basically, I want to run the function 31 times (once per day), and the input would be the 24 entries of the day (one every hour) I tried using 所以基本上，我想运行该函数31次（每天一次），输入将是我尝试使用的一天的24个条目（每小时一个）

rmse(df.groupby([df.index.day]).mean().value1, 
    df.groupby([df.index.day]).mean().value2)

but it gave me a single value, and what I want is a list with the rmse of each day, such as 但这给了我一个单一的价值，我想要的是一张每天均方根的清单，例如

daily_rmse = [rmse01_01, rmse01_02, ..., rmse01_31]

Answer 1

You do not need to keep redoing the groupby and you need to compute rmse on each element of it, not on the sequence of means: 您无需继续重做groupby而需要在它的每个元素上而不是在均值序列上计算rmse ：

gb = df.groupby(df.index.date)
mean_by_day = gb.mean()
rmse_by_day = gb.std(ddof=0)

I suspect that the RMSE formula you are applying is exactly equivalent to the standard deviation normalized by the number of elements (not the number of elements - 1, as is default in Pandas). 我怀疑您要应用的RMSE公式完全等于通过元素数量（而不是元素数量-1，这是熊猫的默认设置）标准化的标准偏差。

You should now be able to access mean_by_day.value1 and std_by_day.value1 to get the values that you want. 现在，您应该可以访问mean_by_day.value1和std_by_day.value1来获取所需的值。

The value I get for mean_by_day is 我为mean_by_day得到的值是

              value1    value2
2016-01-01  5.416667  6.541667
2016-01-02  0.125000  0.000000

Similarly, for rmse_by_day I get 同样，对于rmse_by_day我得到

              value1    value2
2016-01-01  5.139039  6.422481
2016-01-02  0.330719  0.000000

Note that the date field of the index is used rather than day , which could be repeated if your data went on for multiple months. 请注意，使用的是索引的date字段，而不是使用day ，如果数据持续多个月，则可以重复使用该字段。

Answer 2

use sklearn s mean_squared_error 使用sklearn的mean_squared_error

from sklearn.metrics import mean_squared_error

df.groupby(df.date.dt.date).apply(
    lambda x: mean_squared_error(x.value1, x.value2) ** .5)

date
2016-01-01    3.494043
2016-01-02    0.377964
dtype: float64

熊猫将功能应用于按天分组的数据

问题描述

2 个解决方案

解决方案1
1 2017-04-19 18:50:31

解决方案2
1 已采纳 2017-04-19 19:49:04

熊猫将功能应用于按天分组的数据

问题描述

2 个解决方案

解决方案1 1 2017-04-19 18:50:31

解决方案2 1 已采纳 2017-04-19 19:49:04

解决方案1
1 2017-04-19 18:50:31

解决方案2
1 已采纳 2017-04-19 19:49:04