简体   繁体   English

R - 使用时间条件计算平均值以及不同列上的其他条件

[英]R - Calculate average using time conditions and also other conditions on different column

I have a data with timestamp, category and the data value as shown below (but >2000 rows). 我有一个带有时间戳,类别和数据值的数据,如下所示(但> 2000行)。

Timestamp   category    data  
7/16/2017 18:04 x   4.9  
7/16/2017 18:18 y   4.7  
7/16/2017 18:32 x   8.2  
7/16/2017 18:46 x   2.2  
7/16/2017 19:00 y   2.7  
7/16/2017 19:14 y   3.8  
7/16/2017 19:28 x   8.0  
7/16/2017 19:42 x   7.3  
7/16/2017 19:56 z   10.1  
7/16/2017 20:10 z   5.4  
7/16/2017 20:42 x   17.5  
7/16/2017 20:56 x   6.3  
7/16/2017 21:10 z   5.8  
7/16/2017 21:24 x   0.6  
7/16/2017 21:38 z   2.2  
7/16/2017 21:52 z   2.9  
7/16/2017 22:06 y   0.5  
7/16/2017 22:20 x   5.1  
7/16/2017 22:34 z   8.0  
7/16/2017 22:48 z   3.6  

I want to calculate average and sd of my data by applying 2 conditions. 我想通过应用2个条件来计算我的数据的平均值和sd。 The average and sd has to be calculated for every 2 hours. 必须每2小时计算平均值和sd。 The average and the sd has to be seperately calculated for x,y,z categories. 必须单独计算x,y,z类别的平均值和sd。

The final data is supposed to look something like this 最终数据应该看起来像这样

Timestamp   category    data_avg    data_sd  
7/16/2017 18:00 x         
7/16/2017 20:00 x         
7/16/2017 22:00 x         
7/17/2017 0:00  x 

Timestamp   category    data_avg    data_sd  
7/16/2017 18:00 y       
7/16/2017 20:00 y       
7/16/2017 22:00 y         
7/17/2017 0:00  y     

Timestamp   category    data_avg    data_sd  
7/16/2017 18:00 z         
7/16/2017 20:00 z         
7/16/2017 22:00 z         
7/17/2017 0:00  z       

I tried filtering and aggregating using the following command 我尝试使用以下命令进行过滤和聚合

df<- aggregate(list(avgdata = df$data), 
                   list(hourofday = cut(df$Timestamp, "1 hour")), 
                   mean)  

But its not working. 但它不起作用。 It is missing so many data points and also it doesnt give mean and sd in same df. 它缺少这么多的数据点,也没有给出相同df的均值和sd。

Please Help. 请帮忙。

Your Timestamp column is in a format, which is not easy to work with in R. Therefore I first turn it into a Datetime variable with as.POSIXlt . 您的Timestamp列采用的格式在R中不易使用。因此,我首先将其转换为带有as.POSIXlt的Datetime变量。

df$Timestamp <- as.POSIXlt(df$Timestamp, format = "%m/%d/%Y %H:%M")

head(df)
#             Timestamp category data
# 1 2017-07-16 18:04:00        x  4.9
# 2 2017-07-16 18:18:00        y  4.7
# 3 2017-07-16 18:32:00        x  8.2
# 4 2017-07-16 18:46:00        x  2.2
# 5 2017-07-16 19:00:00        y  2.7
# 6 2017-07-16 19:14:00        y  3.8

After this, the aggregate function works withe de appropriate arguments. 在此之后,聚合函数适用于适当的参数。 I added category to the list of variables to group by and modified the FUN argument to calculate both mean and sd . 我将类别添加到要分组的变量列表中,并修改了FUN参数以计算mean和标准sd

aggregate(list(avgdata = df$data), 
          list(hourofday = cut(df$Timestamp, "2 hour"), 
               category = df$category), 
          FUN = function(x) c(data_avg = mean(x), data_sd = length(x)))

#             hourofday category avgdata.data_avg avgdata.data_sd
# 1 2017-07-16 18:00:00        x         6.120000        5.000000
# 2 2017-07-16 20:00:00        x         8.133333        3.000000
# 3 2017-07-16 22:00:00        x         5.100000        1.000000
# 4 2017-07-16 18:00:00        y         3.733333        3.000000
# 5 2017-07-16 22:00:00        y         0.500000        1.000000
# 6 2017-07-16 18:00:00        z        10.100000        1.000000
# 7 2017-07-16 20:00:00        z         4.075000        4.000000
# 8 2017-07-16 22:00:00        z         5.800000        2.000000
library(dplyr)
library(lubridate)

df = structure(list(Timestamp = c("7/16/2017 18:04", "7/16/2017 18:18", 
"7/16/2017 18:32", "7/16/2017 18:46", "7/16/2017 19:00", "7/16/2017 19:14", 
"7/16/2017 19:28", "7/16/2017 19:42", "7/16/2017 19:56", "7/16/2017 20:10", 
"7/16/2017 20:42", "7/16/2017 20:56", "7/16/2017 21:10", "7/16/2017 21:24", 
"7/16/2017 21:38", "7/16/2017 21:52", "7/16/2017 22:06", "7/16/2017 22:20", 
"7/16/2017 22:34", "7/16/2017 22:48"), Category = c("x", "y", 
"x", "x", "y", "y", "x", "x", "z", "z", "x", "x", "z", "x", "z", 
"z", "y", "x", "z", "z"), data = c(4.9, 4.7, 8.2, 2.2, 2.7, 3.8, 
8, 7.3, 10.1, 5.4, 17.5, 6.3, 5.8, 0.6, 2.2, 2.9, 0.5, 5.1, 8, 
3.6)), .Names = c("Timestamp", "Category", "data"), class = "data.frame", row.names = c(NA, -20L))


df %>%
  mutate(Timestamp = mdy_hm(Timestamp),                   # update to a datetime variable (if needed)
         TimeDiff = difftime(Timestamp, min(Timestamp), units = "hours"),  # get the distance from the first timestamp of the dataset (in hours)
         TimeGroup = as.numeric(TimeDiff) %/% 2) %>%      # create a grouping variable based on the distance
  group_by(TimeGroup, Category) %>%                       # for each group and category
  summarise(Category_MinTime = min(Timestamp),            # get the first time stamp for this category in this group
            data_avg = mean(data),                        # get average
            data_sd = sd(data),                           # get sd
            NumObs = n()) %>%                             # get number of observations (might be useful)
  mutate(TimeGroup_MinTime = min(Category_MinTime)) %>%   # get first time stamp of that time group
  ungroup() %>%                                           # forget the grouping
  select(TimeGroup, TimeGroup_MinTime, everything())      # re arrange columns


# # A tibble: 8 x 7
#   TimeGroup   TimeGroup_MinTime Category    Category_MinTime  data_avg  data_sd NumObs
#       <dbl>              <dttm>    <chr>              <dttm>     <dbl>    <dbl>  <int>
# 1         0 2017-07-16 18:04:00        x 2017-07-16 18:04:00  6.120000 2.554799      5
# 2         0 2017-07-16 18:04:00        y 2017-07-16 18:18:00  3.733333 1.001665      3
# 3         0 2017-07-16 18:04:00        z 2017-07-16 19:56:00 10.100000      NaN      1
# 4         1 2017-07-16 20:10:00        x 2017-07-16 20:42:00  8.133333 8.597868      3
# 5         1 2017-07-16 20:10:00        z 2017-07-16 20:10:00  4.075000 1.791415      4
# 6         2 2017-07-16 22:06:00        x 2017-07-16 22:20:00  5.100000      NaN      1
# 7         2 2017-07-16 22:06:00        y 2017-07-16 22:06:00  0.500000      NaN      1
# 8         2 2017-07-16 22:06:00        z 2017-07-16 22:34:00  5.800000 3.111270      2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据 R 中其他列中的值和条件计算新列 - Calculate a new column from values and conditions in the other columns in R 基于条件的一列的平均值(在不同的列中) - Average of one column based on conditions (in different columns) R 使用不同的数据框列条件填充新列 - R using different data frame column conditions to populate new column 使用条件 R 从数据框中的列计算权重 - calculate weights from a column in a dataframe with conditions R 如何根据R中的标准/条件计算不同行之间的时间差 - How to calculate time difference between different rows based off criteria/conditions in R 如何在 r 中根据不同的列条件绘制散点图? - How to scatterplot based different column conditions in r? 如何在某些条件下使用data.table,使用R进行聚合来计算不同列的均值和中位数 - How to calculate mean and median of different columns under some conditions using data.table, aggregation with R R:使用其他列的滞后值和data.table的许多其他条件创建新列 - R: New column creation using lag values from other columns & many other conditions with data.table 基于2个不同条件的R计数列 - Counting Columns using R based on 2 Different Conditions R:使用多个if条件的数据框新列 - R: dataframe new column using multiple if conditions
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM