I want to distinctly count the number of customers who have purchased from the company between each SKU's first and last purchase date. This is after I have distinctly counted the number of customers for each SKU given in SQL (as well as finding the first and last purchase date),
I have code that successfully solves this problem; however, it uses a for loop and it is taking far too long because there are tens of thousands of SKUs. This is short example of what my SKU table looks like:
SKUID <- c('123', '456', '789')
NumberOfCustomers <- c(204543, 92703, 305727)
SKUFirstPurchase <- c('2014-05-02', '2014-02-03', '2016-05-13')
SKULastPurchase <- c('2017-09-30', '2018-07-01', '2019-01-09')
SKUCount <- data.frame(SKUID, NumberOfCustomers,
SKUFirstPurchase, SKULastPurchase)
colnames(SKUCount) <- c('SKU', 'NumberOfCustomers',
'FirstPurchase', 'LastPurchase')
Then I have another table that is about 6 million rows long, a select distinct of the sales date and the CustomerID that I call OrderTable. I can't summarize the distinct count on a day-to-day basis and sum them together because this would double count customers who have purchased on separate days. I have to re-calculate the distinct count with every FirstPurchase/LastPurchase permutation that I see in my SKUCount table. From there, I use the following code to calculate the distinct number of customers in the given time frame:
library(dplyr)
for (i in 1:nrow(SKUCount))
{
SKUCount[i, c('DateCustomers')] <-
sapply(OrderTable %>%
filter(Date >= SKUCount[i,'FirstPurchase'],
Date <= SKUCount[i,'LastPurchase']) %>%
select(CustomerID),
function(x) length(unique(x)))
}
As I previously noted, this piece of code DOES work, but it's very slow (~0.5 second for each row). Is there a quicker way to calculate the distinct counts, or is there a more clever solution to my problem?
Try this one:
library("purrrlyr")
library("dplyr")
#First creating the datasets including OrderTable (please correct me if I got it wrong!):
SKUID <- c('123', '456', '789')
NumberOfCustomers <- c(204543, 92703, 305727)
SKUFirstPurchase <- c('2014-05-02', '2014-02-03', '2016-05-13')
SKULastPurchase <- c('2017-09-30', '2018-07-01', '2019-01-09')
SKUCount <- data.frame(SKUID, NumberOfCustomers,
SKUFirstPurchase, SKULastPurchase)
colnames(SKUCount) <- c('SKU', 'NumberOfCustomers',
'FirstPurchase', 'LastPurchase')
OrderTable <- data.frame(Date=c('2014-06-02', '2014-08-02', '2015-02-03', '2017-05-13'
,'2015-05-02', '2014-06-03', '2016-07-13', '2017-09-30', '2018-07-01', '2019-01-09'),
CustomerID=c('121','212','3434','24232','121','124','212','131','412','3634'))
#changing factors to date
SKUCount$FirstPurchase<-as.Date(SKUCount$FirstPurchase,format = "%Y-%m-%d")
SKUCount$LastPurchase<-as.Date(SKUCount$LastPurchase,format = "%Y-%m-%d")
OrderTable$Date<-as.Date(OrderTable$Date,format = "%Y-%m-%d")
#defining a function, named FUN, which limit the Date from OrderTable between
#the two date arguments (FirstPurchase and LastPurchase) and returns the
#distinct count of CustomerID's from OrderTable:
FUN <- function(FirstPurchase,LastPurchase){
Rtrn<-OrderTable %>%
filter(Date >= FirstPurchase,
Date <= LastPurchase) %>%
summarize(n_distinct(CustomerID))
as.numeric(Rtrn)
}
SKUCount %>%
rowwise() %>%
mutate(DateCustomers= FUN(FirstPurchase,LastPurchase))
# Source: local data frame [3 x 5]
# Groups: <by row>
#
# # A tibble: 3 x 5
# SKU NumberOfCustomers FirstPurchase LastPurchase DateCustomers
# <fct> <dbl> <date> <date> <dbl>
# 1 123 204543 2014-05-02 2017-09-30 6
# 2 456 92703 2014-02-03 2018-07-01 7
# 3 789 305727 2016-05-13 2019-01-09 5
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.