简体   繁体   中英

How can I implement a dynamic count within R without using a for loop?

I want to distinctly count the number of customers who have purchased from the company between each SKU's first and last purchase date. This is after I have distinctly counted the number of customers for each SKU given in SQL (as well as finding the first and last purchase date),

I have code that successfully solves this problem; however, it uses a for loop and it is taking far too long because there are tens of thousands of SKUs. This is short example of what my SKU table looks like:

SKUID <- c('123', '456', '789')
NumberOfCustomers <- c(204543, 92703, 305727)
SKUFirstPurchase <- c('2014-05-02', '2014-02-03', '2016-05-13')
SKULastPurchase <- c('2017-09-30', '2018-07-01', '2019-01-09')

SKUCount <- data.frame(SKUID, NumberOfCustomers, 
                       SKUFirstPurchase, SKULastPurchase)
colnames(SKUCount) <- c('SKU', 'NumberOfCustomers', 
                        'FirstPurchase', 'LastPurchase')

Then I have another table that is about 6 million rows long, a select distinct of the sales date and the CustomerID that I call OrderTable. I can't summarize the distinct count on a day-to-day basis and sum them together because this would double count customers who have purchased on separate days. I have to re-calculate the distinct count with every FirstPurchase/LastPurchase permutation that I see in my SKUCount table. From there, I use the following code to calculate the distinct number of customers in the given time frame:

library(dplyr)

for (i in 1:nrow(SKUCount))
{
  SKUCount[i, c('DateCustomers')] <-
    sapply(OrderTable %>%
              filter(Date >= SKUCount[i,'FirstPurchase'],
                     Date <= SKUCount[i,'LastPurchase']) %>%
              select(CustomerID),
           function(x) length(unique(x)))
}

As I previously noted, this piece of code DOES work, but it's very slow (~0.5 second for each row). Is there a quicker way to calculate the distinct counts, or is there a more clever solution to my problem?

Try this one:

    library("purrrlyr")
    library("dplyr")

#First creating the datasets including OrderTable (please correct me if I got it wrong!):
    SKUID <- c('123', '456', '789')
    NumberOfCustomers <- c(204543, 92703, 305727)
    SKUFirstPurchase <- c('2014-05-02', '2014-02-03', '2016-05-13')
    SKULastPurchase <- c('2017-09-30', '2018-07-01', '2019-01-09')

    SKUCount <- data.frame(SKUID, NumberOfCustomers, 
                           SKUFirstPurchase, SKULastPurchase)
    colnames(SKUCount) <- c('SKU', 'NumberOfCustomers', 
                            'FirstPurchase', 'LastPurchase')

    OrderTable <- data.frame(Date=c('2014-06-02', '2014-08-02', '2015-02-03', '2017-05-13'
    ,'2015-05-02', '2014-06-03', '2016-07-13', '2017-09-30', '2018-07-01', '2019-01-09'),
    CustomerID=c('121','212','3434','24232','121','124','212','131','412','3634'))

#changing factors to date
    SKUCount$FirstPurchase<-as.Date(SKUCount$FirstPurchase,format = "%Y-%m-%d")
    SKUCount$LastPurchase<-as.Date(SKUCount$LastPurchase,format = "%Y-%m-%d")
    OrderTable$Date<-as.Date(OrderTable$Date,format = "%Y-%m-%d")

#defining a function, named FUN, which limit the Date from OrderTable between 
#the two date arguments (FirstPurchase and LastPurchase) and returns the 
#distinct count of CustomerID's from OrderTable:
FUN <- function(FirstPurchase,LastPurchase){
              Rtrn<-OrderTable %>%
              filter(Date >= FirstPurchase,
              Date <= LastPurchase)  %>%
              summarize(n_distinct(CustomerID))
              as.numeric(Rtrn)
              }

Next you want to take your dataset, SKUCount, and create a variable called DateCustomers by applying the function, FUN, to every row of it:

    SKUCount %>% 
      rowwise() %>% 
       mutate(DateCustomers= FUN(FirstPurchase,LastPurchase))
      # Source: local data frame [3 x 5]
      # Groups: <by row>
      #   
      #   # A tibble: 3 x 5
      #   SKU   NumberOfCustomers FirstPurchase LastPurchase DateCustomers
      # <fct>             <dbl> <date>        <date>               <dbl>
      #   1 123              204543 2014-05-02    2017-09-30          6
      # 2 456               92703 2014-02-03    2018-07-01            7
      # 3 789              305727 2016-05-13    2019-01-09            5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM