简体   繁体   English

如何计算 R 中唯一观察行的移动总和(提供示例)?

[英]How to calculate moving sums of rows of unique observation in R (example provided)?

I have a data.frame called df , and I am trying to count unique devices on a moving hour basis with a gap of 4 minutes as shown in expected_df .我有一个名为dfdata.frame ,我试图以移动小时为基础计算唯一设备,间隔为 4 分钟,如expected_df所示。 I have provided an example below.我在下面提供了一个例子。 I did not find any query meeting my requirements, hence this...我没有找到任何符合我要求的查询,因此...

Given data frame给定数据框

df<-data.frame(customer=rep("xyz",19),
           device=c("a","a","a","a","a","a","b","b","b","b","b","b","c","c","c","c","c","c","c"),
           local_date=rep("2020-06-04",19),
           local_hour=rep(4,19),
           local_minute=c(1:6,4:9,6:12))

Expected data frame预期数据框

expected_df<-data.frame(customer=rep("xyz",8),local_hour=rep(4,8),
                    local_date=rep("2020-06-04",8),
                    local_minute_bucket=c("1_5","2_6","3_7","4_9","5_9","6_10","7_11","8_12"),
                    unique_devices=c(2,3,3,3,3,3,2,2))

Here's an approach with dplyr , lubridate and purrr :这是使用dplyrlubridatepurrr的方法:

First, we create the time so that windows can cross hours and days.首先,我们创建时间,以便 windows 可以跨越数小时和数天。 Second, we arrange by time so that findInterval can work.其次,我们按时间安排,以便findInterval可以工作。 Next, we create a bucket.接下来,我们创建一个存储桶。 Then we use findInterval to calculate the rows that are within the bucket.然后我们使用findInterval来计算存储桶内的行。 Then, we use map2 to find the unique devices in those rows.然后,我们使用map2在这些行中查找唯一设备。

We then deselect the appropriate columns and use unique to get rid of duplicates.然后我们取消选择适当的列并使用unique来消除重复项。

library(dplyr)
library(purrr)
library(lubridate)
df %>% 
  mutate(time = ymd_hm(paste0(local_date," ",local_hour,":",local_minute))) %>%
  arrange(time) %>%
  mutate(local_minute_bucket = paste0(minute(time),"_",minute(time+minutes(4))),
         unique_devices = map2_chr(row_number(time),
                                   findInterval(time + minutes(4),time),
                                   ~length(unique(device[.x:.y])))) %>%
  dplyr::select(-time,-device) %>%
  unique()
   customer local_date local_hour local_minute local_minute_bucket unique_devices
1       xyz 2020-06-04          4            1                 1_5              2
2       xyz 2020-06-04          4            2                 2_6              3
3       xyz 2020-06-04          4            3                 3_7              3
4       xyz 2020-06-04          4            4                 4_8              3
6       xyz 2020-06-04          4            5                 5_9              3
8       xyz 2020-06-04          4            6                6_10              3
9       xyz 2020-06-04          4            6                6_10              2
11      xyz 2020-06-04          4            7                7_11              2
13      xyz 2020-06-04          4            8                8_12              2
15      xyz 2020-06-04          4            9                9_13              2
16      xyz 2020-06-04          4            9                9_13              1
17      xyz 2020-06-04          4           10               10_14              1
18      xyz 2020-06-04          4           11               11_15              1
19      xyz 2020-06-04          4           12               12_16              1

here is a data.table approach这是一个data.table方法

#create an data.table with intervals
library( data.table )
library( lubridate )

#make df a data.table
setDT( df )
#create a proper timestamp
df[, timestamp := as.POSIXct( paste0( local_date, "T", local_hour, ":", local_minute ),
                              format = "%Y-%m-%dT%H:%M" )]
#crete a data.table with intervals
dt.intervals <- data.table( start = seq( min( df$timestamp, na.rm = TRUE ),
                                         max( df$timestamp, na.rm = TRUE ),
                                         by = "1 min" ) )
dt.intervals[, end := start %m+% minutes(4)]
dt.intervals[, local_minute_bucket := paste( format( start, "%M"), format( end, "%M"), sep = "_" )]
#join
ans <- dt.intervals[ df, on = .( start <= timestamp, end >= timestamp ), allow.cartesian = TRUE ]
#and summarise
ans[, .(unique_devides = uniqueN( device )), by = .(customer, local_hour, local_date, local_minute_bucket )]


#    customer local_hour local_date local_minute_bucket unique_devides
# 1:      xyz          4 2020-06-04               01_05              2
# 2:      xyz          4 2020-06-04               02_06              3
# 3:      xyz          4 2020-06-04               03_07              3
# 4:      xyz          4 2020-06-04               04_08              3
# 5:      xyz          4 2020-06-04               05_09              3
# 6:      xyz          4 2020-06-04               06_10              3
# 7:      xyz          4 2020-06-04               07_11              2
# 8:      xyz          4 2020-06-04               08_12              2
# 9:      xyz          4 2020-06-04               09_13              2
#10:      xyz          4 2020-06-04               10_14              1
#11:      xyz          4 2020-06-04               11_15              1
#12:      xyz          4 2020-06-04               12_16              1

Can't come up with anything better than this.想不出比这更好的了。

A function which gives the data for a particular time interval. function 给出特定时间间隔的数据。

library(dplyr)
library(tidyr)
library(purrr)

Get_data <- function(data, minute) {

  data %>%
    filter(local_minute >= minute & local_minute <= minute + 4) %>%
    summarise(local_minute_bucket = paste(min(local_minute), 
                                          max(local_minute), sep = '-'), 
              unique_devices = n_distinct(device))
}

Split the data based for every date and hour and apply this function for every minute till max - 4 minute.根据每个日期和小时拆分数据,然后每分钟应用此 function,直到max - 4分钟。

df %>%
  group_split(local_date, local_hour) %>%
  map_df(function(data) data %>% 
                          group_by(local_date, local_hour) %>%
                          summarise(new = list(map(seq_len(max(local_minute) - 4),
                                    Get_data, data = data)))) %>%
   unnest(new) %>%
   unnest_wider(new)


#  local_date local_hour local_minute_bucket unique_devices
#  <chr>           <dbl> <chr>                        <int>
#1 2020-06-04          4 1-5                              2
#2 2020-06-04          4 2-6                              3
#3 2020-06-04          4 3-7                              3
#4 2020-06-04          4 4-8                              3
#5 2020-06-04          4 5-9                              3
#6 2020-06-04          4 6-10                             3
#7 2020-06-04          4 7-11                             2
#8 2020-06-04          4 8-12                             2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM