[英]How to calculate moving sums of rows of unique observation in R (example provided)?
I have a data.frame
called df
, and I am trying to count unique devices on a moving hour basis with a gap of 4 minutes as shown in expected_df
.我有一个名为df
的data.frame
,我试图以移动小时为基础计算唯一设备,间隔为 4 分钟,如expected_df
所示。 I have provided an example below.我在下面提供了一个例子。 I did not find any query meeting my requirements, hence this...我没有找到任何符合我要求的查询,因此...
df<-data.frame(customer=rep("xyz",19),
device=c("a","a","a","a","a","a","b","b","b","b","b","b","c","c","c","c","c","c","c"),
local_date=rep("2020-06-04",19),
local_hour=rep(4,19),
local_minute=c(1:6,4:9,6:12))
expected_df<-data.frame(customer=rep("xyz",8),local_hour=rep(4,8),
local_date=rep("2020-06-04",8),
local_minute_bucket=c("1_5","2_6","3_7","4_9","5_9","6_10","7_11","8_12"),
unique_devices=c(2,3,3,3,3,3,2,2))
Here's an approach with dplyr
, lubridate
and purrr
:这是使用dplyr
、 lubridate
和purrr
的方法:
First, we create the time so that windows can cross hours and days.首先,我们创建时间,以便 windows 可以跨越数小时和数天。 Second, we arrange by time so that findInterval
can work.其次,我们按时间安排,以便findInterval
可以工作。 Next, we create a bucket.接下来,我们创建一个存储桶。 Then we use findInterval
to calculate the rows that are within the bucket.然后我们使用findInterval
来计算存储桶内的行。 Then, we use map2
to find the unique devices in those rows.然后,我们使用map2
在这些行中查找唯一设备。
We then deselect the appropriate columns and use unique
to get rid of duplicates.然后我们取消选择适当的列并使用unique
来消除重复项。
library(dplyr)
library(purrr)
library(lubridate)
df %>%
mutate(time = ymd_hm(paste0(local_date," ",local_hour,":",local_minute))) %>%
arrange(time) %>%
mutate(local_minute_bucket = paste0(minute(time),"_",minute(time+minutes(4))),
unique_devices = map2_chr(row_number(time),
findInterval(time + minutes(4),time),
~length(unique(device[.x:.y])))) %>%
dplyr::select(-time,-device) %>%
unique()
customer local_date local_hour local_minute local_minute_bucket unique_devices
1 xyz 2020-06-04 4 1 1_5 2
2 xyz 2020-06-04 4 2 2_6 3
3 xyz 2020-06-04 4 3 3_7 3
4 xyz 2020-06-04 4 4 4_8 3
6 xyz 2020-06-04 4 5 5_9 3
8 xyz 2020-06-04 4 6 6_10 3
9 xyz 2020-06-04 4 6 6_10 2
11 xyz 2020-06-04 4 7 7_11 2
13 xyz 2020-06-04 4 8 8_12 2
15 xyz 2020-06-04 4 9 9_13 2
16 xyz 2020-06-04 4 9 9_13 1
17 xyz 2020-06-04 4 10 10_14 1
18 xyz 2020-06-04 4 11 11_15 1
19 xyz 2020-06-04 4 12 12_16 1
here is a data.table
approach这是一个data.table
方法
#create an data.table with intervals
library( data.table )
library( lubridate )
#make df a data.table
setDT( df )
#create a proper timestamp
df[, timestamp := as.POSIXct( paste0( local_date, "T", local_hour, ":", local_minute ),
format = "%Y-%m-%dT%H:%M" )]
#crete a data.table with intervals
dt.intervals <- data.table( start = seq( min( df$timestamp, na.rm = TRUE ),
max( df$timestamp, na.rm = TRUE ),
by = "1 min" ) )
dt.intervals[, end := start %m+% minutes(4)]
dt.intervals[, local_minute_bucket := paste( format( start, "%M"), format( end, "%M"), sep = "_" )]
#join
ans <- dt.intervals[ df, on = .( start <= timestamp, end >= timestamp ), allow.cartesian = TRUE ]
#and summarise
ans[, .(unique_devides = uniqueN( device )), by = .(customer, local_hour, local_date, local_minute_bucket )]
# customer local_hour local_date local_minute_bucket unique_devides
# 1: xyz 4 2020-06-04 01_05 2
# 2: xyz 4 2020-06-04 02_06 3
# 3: xyz 4 2020-06-04 03_07 3
# 4: xyz 4 2020-06-04 04_08 3
# 5: xyz 4 2020-06-04 05_09 3
# 6: xyz 4 2020-06-04 06_10 3
# 7: xyz 4 2020-06-04 07_11 2
# 8: xyz 4 2020-06-04 08_12 2
# 9: xyz 4 2020-06-04 09_13 2
#10: xyz 4 2020-06-04 10_14 1
#11: xyz 4 2020-06-04 11_15 1
#12: xyz 4 2020-06-04 12_16 1
Can't come up with anything better than this.想不出比这更好的了。
A function which gives the data for a particular time interval. function 给出特定时间间隔的数据。
library(dplyr)
library(tidyr)
library(purrr)
Get_data <- function(data, minute) {
data %>%
filter(local_minute >= minute & local_minute <= minute + 4) %>%
summarise(local_minute_bucket = paste(min(local_minute),
max(local_minute), sep = '-'),
unique_devices = n_distinct(device))
}
Split the data based for every date and hour and apply this function for every minute till max - 4
minute.根据每个日期和小时拆分数据,然后每分钟应用此 function,直到max - 4
分钟。
df %>%
group_split(local_date, local_hour) %>%
map_df(function(data) data %>%
group_by(local_date, local_hour) %>%
summarise(new = list(map(seq_len(max(local_minute) - 4),
Get_data, data = data)))) %>%
unnest(new) %>%
unnest_wider(new)
# local_date local_hour local_minute_bucket unique_devices
# <chr> <dbl> <chr> <int>
#1 2020-06-04 4 1-5 2
#2 2020-06-04 4 2-6 3
#3 2020-06-04 4 3-7 3
#4 2020-06-04 4 4-8 3
#5 2020-06-04 4 5-9 3
#6 2020-06-04 4 6-10 3
#7 2020-06-04 4 7-11 2
#8 2020-06-04 4 8-12 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.