R - 使用时间条件计算平均值以及不同列上的其他条件

Question

I have a data with timestamp, category and the data value as shown below (but >2000 rows). 我有一个带有时间戳，类别和数据值的数据，如下所示（但> 2000行）。

Timestamp   category    data  
7/16/2017 18:04 x   4.9  
7/16/2017 18:18 y   4.7  
7/16/2017 18:32 x   8.2  
7/16/2017 18:46 x   2.2  
7/16/2017 19:00 y   2.7  
7/16/2017 19:14 y   3.8  
7/16/2017 19:28 x   8.0  
7/16/2017 19:42 x   7.3  
7/16/2017 19:56 z   10.1  
7/16/2017 20:10 z   5.4  
7/16/2017 20:42 x   17.5  
7/16/2017 20:56 x   6.3  
7/16/2017 21:10 z   5.8  
7/16/2017 21:24 x   0.6  
7/16/2017 21:38 z   2.2  
7/16/2017 21:52 z   2.9  
7/16/2017 22:06 y   0.5  
7/16/2017 22:20 x   5.1  
7/16/2017 22:34 z   8.0  
7/16/2017 22:48 z   3.6

I want to calculate average and sd of my data by applying 2 conditions. 我想通过应用2个条件来计算我的数据的平均值和sd。 The average and sd has to be calculated for every 2 hours. 必须每2小时计算平均值和sd。 The average and the sd has to be seperately calculated for x,y,z categories. 必须单独计算x，y，z类别的平均值和sd。

The final data is supposed to look something like this 最终数据应该看起来像这样

Timestamp   category    data_avg    data_sd  
7/16/2017 18:00 x         
7/16/2017 20:00 x         
7/16/2017 22:00 x         
7/17/2017 0:00  x 

Timestamp   category    data_avg    data_sd  
7/16/2017 18:00 y       
7/16/2017 20:00 y       
7/16/2017 22:00 y         
7/17/2017 0:00  y     

Timestamp   category    data_avg    data_sd  
7/16/2017 18:00 z         
7/16/2017 20:00 z         
7/16/2017 22:00 z         
7/17/2017 0:00  z

I tried filtering and aggregating using the following command 我尝试使用以下命令进行过滤和聚合

df<- aggregate(list(avgdata = df$data), 
                   list(hourofday = cut(df$Timestamp, "1 hour")), 
                   mean)

But its not working. 但它不起作用。 It is missing so many data points and also it doesnt give mean and sd in same df. 它缺少这么多的数据点，也没有给出相同df的均值和sd。

Please Help. 请帮忙。

Answer 1

Your Timestamp column is in a format, which is not easy to work with in R. Therefore I first turn it into a Datetime variable with as.POSIXlt . 您的Timestamp列采用的格式在R中不易使用。因此，我首先将其转换为带有as.POSIXlt的Datetime变量。

df$Timestamp <- as.POSIXlt(df$Timestamp, format = "%m/%d/%Y %H:%M")

head(df)
#             Timestamp category data
# 1 2017-07-16 18:04:00        x  4.9
# 2 2017-07-16 18:18:00        y  4.7
# 3 2017-07-16 18:32:00        x  8.2
# 4 2017-07-16 18:46:00        x  2.2
# 5 2017-07-16 19:00:00        y  2.7
# 6 2017-07-16 19:14:00        y  3.8

After this, the aggregate function works withe de appropriate arguments. 在此之后，聚合函数适用于适当的参数。 I added category to the list of variables to group by and modified the FUN argument to calculate both mean and sd . 我将类别添加到要分组的变量列表中，并修改了FUN参数以计算mean和标准sd 。

aggregate(list(avgdata = df$data), 
          list(hourofday = cut(df$Timestamp, "2 hour"), 
               category = df$category), 
          FUN = function(x) c(data_avg = mean(x), data_sd = length(x)))

#             hourofday category avgdata.data_avg avgdata.data_sd
# 1 2017-07-16 18:00:00        x         6.120000        5.000000
# 2 2017-07-16 20:00:00        x         8.133333        3.000000
# 3 2017-07-16 22:00:00        x         5.100000        1.000000
# 4 2017-07-16 18:00:00        y         3.733333        3.000000
# 5 2017-07-16 22:00:00        y         0.500000        1.000000
# 6 2017-07-16 18:00:00        z        10.100000        1.000000
# 7 2017-07-16 20:00:00        z         4.075000        4.000000
# 8 2017-07-16 22:00:00        z         5.800000        2.000000

Answer 2

library(dplyr)
library(lubridate)

df = structure(list(Timestamp = c("7/16/2017 18:04", "7/16/2017 18:18", 
"7/16/2017 18:32", "7/16/2017 18:46", "7/16/2017 19:00", "7/16/2017 19:14", 
"7/16/2017 19:28", "7/16/2017 19:42", "7/16/2017 19:56", "7/16/2017 20:10", 
"7/16/2017 20:42", "7/16/2017 20:56", "7/16/2017 21:10", "7/16/2017 21:24", 
"7/16/2017 21:38", "7/16/2017 21:52", "7/16/2017 22:06", "7/16/2017 22:20", 
"7/16/2017 22:34", "7/16/2017 22:48"), Category = c("x", "y", 
"x", "x", "y", "y", "x", "x", "z", "z", "x", "x", "z", "x", "z", 
"z", "y", "x", "z", "z"), data = c(4.9, 4.7, 8.2, 2.2, 2.7, 3.8, 
8, 7.3, 10.1, 5.4, 17.5, 6.3, 5.8, 0.6, 2.2, 2.9, 0.5, 5.1, 8, 
3.6)), .Names = c("Timestamp", "Category", "data"), class = "data.frame", row.names = c(NA, -20L))


df %>%
  mutate(Timestamp = mdy_hm(Timestamp),                   # update to a datetime variable (if needed)
         TimeDiff = difftime(Timestamp, min(Timestamp), units = "hours"),  # get the distance from the first timestamp of the dataset (in hours)
         TimeGroup = as.numeric(TimeDiff) %/% 2) %>%      # create a grouping variable based on the distance
  group_by(TimeGroup, Category) %>%                       # for each group and category
  summarise(Category_MinTime = min(Timestamp),            # get the first time stamp for this category in this group
            data_avg = mean(data),                        # get average
            data_sd = sd(data),                           # get sd
            NumObs = n()) %>%                             # get number of observations (might be useful)
  mutate(TimeGroup_MinTime = min(Category_MinTime)) %>%   # get first time stamp of that time group
  ungroup() %>%                                           # forget the grouping
  select(TimeGroup, TimeGroup_MinTime, everything())      # re arrange columns


# # A tibble: 8 x 7
#   TimeGroup   TimeGroup_MinTime Category    Category_MinTime  data_avg  data_sd NumObs
#       <dbl>              <dttm>    <chr>              <dttm>     <dbl>    <dbl>  <int>
# 1         0 2017-07-16 18:04:00        x 2017-07-16 18:04:00  6.120000 2.554799      5
# 2         0 2017-07-16 18:04:00        y 2017-07-16 18:18:00  3.733333 1.001665      3
# 3         0 2017-07-16 18:04:00        z 2017-07-16 19:56:00 10.100000      NaN      1
# 4         1 2017-07-16 20:10:00        x 2017-07-16 20:42:00  8.133333 8.597868      3
# 5         1 2017-07-16 20:10:00        z 2017-07-16 20:10:00  4.075000 1.791415      4
# 6         2 2017-07-16 22:06:00        x 2017-07-16 22:20:00  5.100000      NaN      1
# 7         2 2017-07-16 22:06:00        y 2017-07-16 22:06:00  0.500000      NaN      1
# 8         2 2017-07-16 22:06:00        z 2017-07-16 22:34:00  5.800000 3.111270      2

R - 使用时间条件计算平均值以及不同列上的其他条件

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-12-12 11:54:10

解决方案2
1 2017-12-12 11:58:54

R - 使用时间条件计算平均值以及不同列上的其他条件

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-12-12 11:54:10

解决方案2 1 2017-12-12 11:58:54

解决方案1
2 已采纳 2017-12-12 11:54:10

解决方案2
1 2017-12-12 11:58:54