[英]R - Calculate average using time conditions and also other conditions on different column
I have a data with timestamp, category and the data value as shown below (but >2000 rows). 我有一个带有时间戳,类别和数据值的数据,如下所示(但> 2000行)。
Timestamp category data
7/16/2017 18:04 x 4.9
7/16/2017 18:18 y 4.7
7/16/2017 18:32 x 8.2
7/16/2017 18:46 x 2.2
7/16/2017 19:00 y 2.7
7/16/2017 19:14 y 3.8
7/16/2017 19:28 x 8.0
7/16/2017 19:42 x 7.3
7/16/2017 19:56 z 10.1
7/16/2017 20:10 z 5.4
7/16/2017 20:42 x 17.5
7/16/2017 20:56 x 6.3
7/16/2017 21:10 z 5.8
7/16/2017 21:24 x 0.6
7/16/2017 21:38 z 2.2
7/16/2017 21:52 z 2.9
7/16/2017 22:06 y 0.5
7/16/2017 22:20 x 5.1
7/16/2017 22:34 z 8.0
7/16/2017 22:48 z 3.6
I want to calculate average and sd of my data by applying 2 conditions. 我想通过应用2个条件来计算我的数据的平均值和sd。 The average and sd has to be calculated for every 2 hours.
必须每2小时计算平均值和sd。 The average and the sd has to be seperately calculated for x,y,z categories.
必须单独计算x,y,z类别的平均值和sd。
The final data is supposed to look something like this 最终数据应该看起来像这样
Timestamp category data_avg data_sd
7/16/2017 18:00 x
7/16/2017 20:00 x
7/16/2017 22:00 x
7/17/2017 0:00 x
Timestamp category data_avg data_sd
7/16/2017 18:00 y
7/16/2017 20:00 y
7/16/2017 22:00 y
7/17/2017 0:00 y
Timestamp category data_avg data_sd
7/16/2017 18:00 z
7/16/2017 20:00 z
7/16/2017 22:00 z
7/17/2017 0:00 z
I tried filtering and aggregating using the following command 我尝试使用以下命令进行过滤和聚合
df<- aggregate(list(avgdata = df$data),
list(hourofday = cut(df$Timestamp, "1 hour")),
mean)
But its not working. 但它不起作用。 It is missing so many data points and also it doesnt give mean and sd in same df.
它缺少这么多的数据点,也没有给出相同df的均值和sd。
Please Help. 请帮忙。
Your Timestamp column is in a format, which is not easy to work with in R. Therefore I first turn it into a Datetime variable with as.POSIXlt
. 您的Timestamp列采用的格式在R中不易使用。因此,我首先将其转换为带有
as.POSIXlt
的Datetime变量。
df$Timestamp <- as.POSIXlt(df$Timestamp, format = "%m/%d/%Y %H:%M")
head(df)
# Timestamp category data
# 1 2017-07-16 18:04:00 x 4.9
# 2 2017-07-16 18:18:00 y 4.7
# 3 2017-07-16 18:32:00 x 8.2
# 4 2017-07-16 18:46:00 x 2.2
# 5 2017-07-16 19:00:00 y 2.7
# 6 2017-07-16 19:14:00 y 3.8
After this, the aggregate function works withe de appropriate arguments. 在此之后,聚合函数适用于适当的参数。 I added category to the list of variables to group by and modified the
FUN
argument to calculate both mean
and sd
. 我将类别添加到要分组的变量列表中,并修改了
FUN
参数以计算mean
和标准sd
。
aggregate(list(avgdata = df$data),
list(hourofday = cut(df$Timestamp, "2 hour"),
category = df$category),
FUN = function(x) c(data_avg = mean(x), data_sd = length(x)))
# hourofday category avgdata.data_avg avgdata.data_sd
# 1 2017-07-16 18:00:00 x 6.120000 5.000000
# 2 2017-07-16 20:00:00 x 8.133333 3.000000
# 3 2017-07-16 22:00:00 x 5.100000 1.000000
# 4 2017-07-16 18:00:00 y 3.733333 3.000000
# 5 2017-07-16 22:00:00 y 0.500000 1.000000
# 6 2017-07-16 18:00:00 z 10.100000 1.000000
# 7 2017-07-16 20:00:00 z 4.075000 4.000000
# 8 2017-07-16 22:00:00 z 5.800000 2.000000
library(dplyr)
library(lubridate)
df = structure(list(Timestamp = c("7/16/2017 18:04", "7/16/2017 18:18",
"7/16/2017 18:32", "7/16/2017 18:46", "7/16/2017 19:00", "7/16/2017 19:14",
"7/16/2017 19:28", "7/16/2017 19:42", "7/16/2017 19:56", "7/16/2017 20:10",
"7/16/2017 20:42", "7/16/2017 20:56", "7/16/2017 21:10", "7/16/2017 21:24",
"7/16/2017 21:38", "7/16/2017 21:52", "7/16/2017 22:06", "7/16/2017 22:20",
"7/16/2017 22:34", "7/16/2017 22:48"), Category = c("x", "y",
"x", "x", "y", "y", "x", "x", "z", "z", "x", "x", "z", "x", "z",
"z", "y", "x", "z", "z"), data = c(4.9, 4.7, 8.2, 2.2, 2.7, 3.8,
8, 7.3, 10.1, 5.4, 17.5, 6.3, 5.8, 0.6, 2.2, 2.9, 0.5, 5.1, 8,
3.6)), .Names = c("Timestamp", "Category", "data"), class = "data.frame", row.names = c(NA, -20L))
df %>%
mutate(Timestamp = mdy_hm(Timestamp), # update to a datetime variable (if needed)
TimeDiff = difftime(Timestamp, min(Timestamp), units = "hours"), # get the distance from the first timestamp of the dataset (in hours)
TimeGroup = as.numeric(TimeDiff) %/% 2) %>% # create a grouping variable based on the distance
group_by(TimeGroup, Category) %>% # for each group and category
summarise(Category_MinTime = min(Timestamp), # get the first time stamp for this category in this group
data_avg = mean(data), # get average
data_sd = sd(data), # get sd
NumObs = n()) %>% # get number of observations (might be useful)
mutate(TimeGroup_MinTime = min(Category_MinTime)) %>% # get first time stamp of that time group
ungroup() %>% # forget the grouping
select(TimeGroup, TimeGroup_MinTime, everything()) # re arrange columns
# # A tibble: 8 x 7
# TimeGroup TimeGroup_MinTime Category Category_MinTime data_avg data_sd NumObs
# <dbl> <dttm> <chr> <dttm> <dbl> <dbl> <int>
# 1 0 2017-07-16 18:04:00 x 2017-07-16 18:04:00 6.120000 2.554799 5
# 2 0 2017-07-16 18:04:00 y 2017-07-16 18:18:00 3.733333 1.001665 3
# 3 0 2017-07-16 18:04:00 z 2017-07-16 19:56:00 10.100000 NaN 1
# 4 1 2017-07-16 20:10:00 x 2017-07-16 20:42:00 8.133333 8.597868 3
# 5 1 2017-07-16 20:10:00 z 2017-07-16 20:10:00 4.075000 1.791415 4
# 6 2 2017-07-16 22:06:00 x 2017-07-16 22:20:00 5.100000 NaN 1
# 7 2 2017-07-16 22:06:00 y 2017-07-16 22:06:00 0.500000 NaN 1
# 8 2 2017-07-16 22:06:00 z 2017-07-16 22:34:00 5.800000 3.111270 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.