简体   繁体   中英

R: Calculating the number of occurrences within a specific time period in the past for each unique individual in a dataset in R

I'm attempting to tally the number of times an event occurred for a given individual within a specific period of past time. In this particular case, I need to know, for each new observation (which reflects a single scheduling request), how many times the individual has scheduled a trip during the preceding 60 days (trip_scheduled). Eventually I will need to tally the number of times that person cancelled on the same day as the scheduled trip for the preceding 60 days. But I'm starting with just the tally in the "moving" 60-day period.

I found some elegant answers to a similar but slightly different problem in this post: R: calculate the number of occurrences of a specific event in a specified time future

My situation differs in a few ways: First, I'm trying to look at a previous time period, and I don't know if that will change my approach, and, two, I need to do the analysis for more than 40,000 individuals, which I've been trying to accomplish through a mix of the code I found in the other answer, a for loop (which I know is frowned upon) and dplyr grouping. It isn't working at all.

Would anyone be able to help point me in the right direction? I'd love to stick to dplyr and base. I just don't know much about data.table.

This is the code and test data I've been trying to noodle on:

test_set2 <- structure(list(tripID = c("20180112-100037-674-101", "20180112-100037-674-201", 
                                       "20180112-100037-674-301", "20180113-100037-676-101", "20180113-100037-676-201", 
                                       "20180115-100037-675-101", "20180115-100037-675-201", "20180116-100037-677-101", 
                                       "20180116-100037-677-201", "20180131-100037-678-101", "20180101-100146-707-101", 
                                       "20180101-100146-707-201", "20180102-100146-708-101", "20180102-100146-708-201", 
                                       "20180103-100146-709-101", "20180103-100146-709-201", "20180104-100146-710-101", 
                                       "20180104-100146-710-201", "20180105-100146-711-101", "20180105-100146-711-201", 
                                       "20180403-100532-223-101", "20180403-100532-223-201", "20180620-100532-224-101", 
                                       "20180620-100532-224-201", "20180704-100532-225-101", "20180704-100532-225-201", 
                                       "20180926-100532-228-101", "20180926-100532-228-201", "20180927-100532-226-101", 
                                       "20180927-100532-226-201"), CUSTOMER_ID = c(100037L, 100037L, 
                                                                                   100037L, 100037L, 100037L, 100037L, 100037L, 100037L, 100037L, 
                                                                                   100037L, 100146L, 100146L, 100146L, 100146L, 100146L, 100146L, 
                                                                                   100146L, 100146L, 100146L, 100146L, 100532L, 100532L, 100532L, 
                                                                                   100532L, 100532L, 100532L, 100532L, 100532L, 100532L, 100532L
                                       ), trip_date = structure(c(17543, 17543, 17543, 17544, 17544, 
                                                                  17546, 17546, 17547, 17547, 17562, 17532, 17532, 17533, 17533, 
                                                                  17534, 17534, 17535, 17535, 17536, 17536, 17624, 17624, 17702, 
                                                                  17702, 17716, 17716, 17800, 17800, 17801, 17801), class = "Date"), 
                            trip_scheduled = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                                               1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), same_day_cancel = c(1, 
                                                                                                                       1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                                                                                                                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -30L), groups = structure(list(
                                                                                                                         CUSTOMER_ID = c(100037L, 100146L, 100532L), .rows = list(
                                                                                                                           1:10, 11:20, 21:30)), row.names = c(NA, -3L), class = c("tbl_df", 
                                                                                                                                                                                   "tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
                                                                                                                                                                                                                                  "tbl_df", "tbl", "data.frame"))

running_frame <- test_set2[1,]

unique_customers <- unique(test_set2$CUSTOMER_ID)

for (cust in unique_customers){
  temp_events <- test_set2 %>% filter(CUSTOMER_ID == i)
  cs = cumsum(temp_events$trip_scheduled) # cumulative number of trips of individual
  output_temp <- data.frame(temp_events, 
                            trips_minus_60 = cs[findInterval(temp_events$trip_date - 60, temp_events$trip_date, left.open = TRUE)] - cs)
  new_table <- rbind(new_table,output_temp)

}

This is the error I generated most recently:

Error in data.frame(temp_events, trips_minus_60 = cs[findInterval(temp_events$trip_date - : arguments imply differing number of rows: 10, 0

I'm not sure this meets your needs, but this is based on @Axeman's tidyverse solution you linked to. After group_by your CUSTOMER_ID you can sum all rows with trip_scheduled is 1 and dates fall between current date and 60 days prior. I would expect you could do something similar for same_day_cancel as well.

library(tidyverse)

test_set2 %>% 
  group_by(CUSTOMER_ID) %>%
    mutate(schedule_60 = unlist(map(trip_date, ~sum(trip_scheduled == 1 & between(trip_date, . - 60, .))))) %>%
  print(n=30)

# A tibble: 30 x 6
# Groups:   CUSTOMER_ID [3]
   tripID                  CUSTOMER_ID trip_date  trip_scheduled same_day_cancel schedule_60
   <chr>                         <int> <date>              <dbl>           <dbl>       <int>
 1 20180112-100037-674-101      100037 2018-01-12              1               1           3
 2 20180112-100037-674-201      100037 2018-01-12              1               1           3
 3 20180112-100037-674-301      100037 2018-01-12              1               1           3
 4 20180113-100037-676-101      100037 2018-01-13              1               0           5
 5 20180113-100037-676-201      100037 2018-01-13              1               0           5
 6 20180115-100037-675-101      100037 2018-01-15              1               1           7
 7 20180115-100037-675-201      100037 2018-01-15              1               1           7
 8 20180116-100037-677-101      100037 2018-01-16              1               0           9
 9 20180116-100037-677-201      100037 2018-01-16              1               0           9
10 20180131-100037-678-101      100037 2018-01-31              1               0          10
11 20180101-100146-707-101      100146 2018-01-01              1               1           2
12 20180101-100146-707-201      100146 2018-01-01              1               1           2
13 20180102-100146-708-101      100146 2018-01-02              1               1           4
14 20180102-100146-708-201      100146 2018-01-02              1               1           4
15 20180103-100146-709-101      100146 2018-01-03              1               1           6
16 20180103-100146-709-201      100146 2018-01-03              1               1           6
17 20180104-100146-710-101      100146 2018-01-04              1               1           8
18 20180104-100146-710-201      100146 2018-01-04              1               1           8
19 20180105-100146-711-101      100146 2018-01-05              1               1          10
20 20180105-100146-711-201      100146 2018-01-05              1               1          10
21 20180403-100532-223-101      100532 2018-04-03              1               0           2
22 20180403-100532-223-201      100532 2018-04-03              1               0           2
23 20180620-100532-224-101      100532 2018-06-20              1               0           2
24 20180620-100532-224-201      100532 2018-06-20              1               0           2
25 20180704-100532-225-101      100532 2018-07-04              1               0           4
26 20180704-100532-225-201      100532 2018-07-04              1               0           4
27 20180926-100532-228-101      100532 2018-09-26              1               0           2
28 20180926-100532-228-201      100532 2018-09-26              1               0           2
29 20180927-100532-226-101      100532 2018-09-27              1               0           4
30 20180927-100532-226-201      100532 2018-09-27              1               0           4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM