[英]Generating dummy webshop data in R: Incorporating parameters when randomly generating transactions

For a course I am currently in I am trying to build a dummy transaction, customer & product dataset to showcase a machine learning usecase in a webshop environment as well as a financial dashboard; 对于我目前正在进行的课程,我正在尝试构建虚拟交易,客户和产品数据集,以在网上商店环境和金融仪表板中展示机器学习用例; unfortunately, we have not been given dummy data. 不幸的是,我们还没有得到虚拟数据。 I figured this'd be a nice way to improve my R knowledge, but am experiencing severe difficulties in realizing it. 我认为这是提高我的R知识的好方法,但在实现它时遇到了严重的困难。

The idea is that I specify some parameters/rules (arbitrary/fictitious, but applicable for a demonstration of a certain clustering algorithm). 我的想法是指定一些参数/规则(任意/虚构,但适用于某种聚类算法的演示)。 I'm basically trying to hide a pattern to then re-find this pattern utilizing machine learning (not part of this question). 我基本上试图隐藏一个模式,然后利用机器学习重新找到这种模式(不是这个问题的一部分)。 The pattern I'm hiding is based on the product adoption life cycle, attempting to show how identifying different customer types could be used for targeted marketing purposes. 我隐藏的模式基于产品采用生命周期,试图展示如何识别不同的客户类型以用于目标营销目的。

I'll demonstrate what I'm looking for. 我将展示我正在寻找的东西。 I'd like to keep it as realistic as possible. 我想尽可能保持现实。 I attempted to do so by assigning the number of transactions per customer and other characteristics to normal distributions; 我试图通过将每个客户的交易数量和其他特征分配给正态分布来实现这一目的; I am completely open to potential other ways to do this? 我对其他潜在的方法完全开放吗?

The following is how far I have come, first build a table of customers: 以下是我到目前为止,首先建立一个客户表:

# Define Customer Types & Respective probabilities
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15)   # Probability of being in each group.

set.seed(1)   # Set seed to make reproducible
Customers <- data.frame(ID=(1:10000), 
  CustomerType = sample(CustomerTypes, size=10000,
                                  replace=TRUE, prob=PropCustTypes),
  NumBought = rnorm(10000,3,2)   # Number of Transactions to Generate, open to alternative solutions?
Customers[Customers$Numbought<0]$NumBought <- 0   # Cap NumBought at 0 

Next, generate a table of products to choose from: 接下来,生成可供选择的产品表:

Products <- data.frame(
  DateReleased = rep(as.Date("2012-12-12"),50)+rnorm(50,0,8000),
  SuggestedPrice = rnorm(50, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10   # Cap ProductPrice at 10$
Products[Products$DateReleased<as.Date("2013-04-10"),]$DateReleased <- as.Date("2013-04-10")   # Cap Releasedate to 1 year ago 

Now I would like to generate n transactions (number is in customer table above), based on the following parameters for each variable that is currently relevant). 现在我想基于当前相关的每个变量的以下参数生成n个事务(数字在上面的客户表中)。

Parameters <- data.frame(
  CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
  BySearchEngine   = c(0.10, .40, 0.50, 0.6), # Probability of coming through channel X
  ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
  ByPartnerBlog    = c(0.30, .30,  0.35, 0.35),
  Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
  Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.

   CustomerType BySearchEngine ByDirectCustomer ByPartnerBlog Timeliness Discount
1  EarlyAdopter            0.1             0.60          0.30          1     0.00
2   Pragmatists            0.4             0.30          0.30          6     0.00
3 Conservatives            0.5             0.15          0.35         12     0.05
4    Dealseeker            0.6             0.05          0.35         12     0.10

The idea is that 'EarlyAdopters' would have (on average, normally distributed) 10% of transactions with a label 'BySearchEngine', 60% 'ByDirectCustomer' and 30% 'ByPartnerBlog'; 我们的想法是,'EarlyAdopters'将(平均而言,正常分布)10%的交易带有标签'BySearchEngine',60%'ByDirectCustomer'和30%'ByPartnerBlog'; these values need to exclude each other: one cannot be obtained via both a PartnerBlog and via a Search Engine in the final dataset. 这些值需要相互排斥:一个不能通过PartnerBlog和最终数据集中的搜索引擎获得。 The options are: 选项是:

ObtainedBy <- c("SearchEngine","DirectCustomer","PartnerBlog")

Furthermore, I'd like to generate a discount variable that is normally distributed utilizing the above means. 此外,我想生成一个折扣变量,通常使用上述方法分配。 For simplicity, standard deviations may be mean/5. 为简单起见,标准偏差可能是平均值/ 5。

Next, my most tricky part, I'd like to generate these transactions according to a few rules: 接下来,我最棘手的部分,我想根据一些规则生成这些事务:

  • Somewhat evenly distributed over days, maybe slightly more during the weekend; 在几天内分布均匀,可能在周末稍微分开;
  • Spread out between 2006-2014. 在2006-2014之间展开。
  • Spreading out the # of transactions of customers over the years; 多年来推广了客户的交易数量;
  • Customers cannot buy products that haven't been released yet. 客户无法购买尚未发布的产品。

Other Parameters: 其他参数:

YearlyMax <- 1 # ? How would I specify this, a growing number would be even nicer?
DailyMax <-  1 # Same question? Likely dependent on YearlyMax

The result for CustomerID 2 would be: CustomerID 2的结果是:

Transactions <- data.frame(
    ID        = c(1,2),
    CustomerID = c(2,2), # The customer that bought the item.
    ProductID = c(51,100), # Products chosen to approach customer type's Timeliness average
    DateOfPurchase = c("2013-01-02", "2012-12-03"), # Date chosen to mimic timeliness average
    ReferredBy = c("DirectCustomer", "SearchEngine"), # See above, follows proportions previously identified.
    GrossPrice = c(50,52.99), # based on Product Price, no real restrictions other than using it for my financial dashboard.
    Discount = c(0.02, 0.0)) # Chosen to mimic customer type's discount behavior.    

  ID CustomerID ProductID DateOfPurchase     ReferredBy GrossPrice Discount
1  1          2        51     2013-01-02 DirectCustomer      50.00     0.02
2  2          2       100     2012-12-03   SearchEngine      52.99     0.00

I'm getting more and more confident in writing R code, but I'm having difficulties writing the code to keep the global parameters (daily distributions of transactions, yearly maximum of # transactions per customer) as well as the various linkages in line: 我对编写R代码越来越有信心了,但是我在编写代码时遇到了困难,无法保存全局参数(每日交易分配,每个客户每年最多#个交易)以及各种链接:

  • Timeliness: how quick people purchase after release 及时性:人们在发布后购买的速度有多快
  • ReferredBy: how did this customer arrive to my website? ReferredBy:这位客户是如何到达我的网站的?
  • How much discount has the customer had (to illustrate how sensitive one is to discounts) 客户有多少折扣(以说明折扣的敏感程度)

This causes me to not know whether I should write a for loop over the customer table, generating transactions per customer, or whether I should take a different route. 这使我不知道是否应该在customer表上写一个for循环,为每个客户生成事务,或者我是否应该采用不同的路由。 Any contributions are greatly appreciated. 非常感谢任何贡献。 Alternative dummy datasets are welcome as well, even though I'm eager to solve this problem by means of R. I'll keep this post updated as I progress. 我也欢迎其他虚拟数据集,即使我渴望通过R解决这个问题。我会随着我的进展更新这篇文章。

My current pseudocode: 我目前的伪代码:

  • Assign customer to customer type with sample() 使用sample()将客户分配给客户类型
  • Generate Customers$NumBought transactions 生成客户$ NumBought交易
  • ... Still thinking? ... 仍然在想?

EDIT: Generating the transactions table, now I 'just' need to fill it with the right data: 编辑:生成事务表,现在我只需要用正确的数据填充它:

Tr <- data.frame(
  ID = 1:sum(Customers$NumBought),
  CustomerID = NA,
  DateOfPurchase = NA,
  ReferredBy = NA,

Very roughly, set up an database of days, and number of visits in that day: 非常粗略地,设置一天的数据库和当天的访问次数:

days<- data.frame(day=1:8000, customerRate = 8000/XtotalNumberOfVisits)
# you could change the customerRate to reflect promotions, time since launch, ...
days$nVisits <- rpois(8000, days$customerRate)

Then catalogue the visits 然后对访问进行编目

    visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
    visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
    visits$nPurchases <- rpois(nrow(vists), XpurchaseRate[visits$customerType])

Any of the variables with X in front of them are parameters of your process. 在它们前面带有X任何变量都是过程的参数。 You'd similarly go on to generate a transactions database by parametrising the relative likelihood amongst objects available, according to the other columns you have. 根据您拥有的其他列,您可以通过参数化可用对象之间的相对可能性来继续生成事务数据库。 Or you can generate a visits database including a key to each product available at that day: 或者,您可以生成一个访问数据库,其中包括当天可用的每个产品的密钥:

   productRelease <- data.frame(id=X, releaseDay=sort(X)) # ie df is sorted by releaseDay
   visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
   visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
   day$productsAvailable = rep(1:nrow(productRelease), times=diff(c(productRelease$releaseDay, nrow(days)+1)))
   visits <- visits[(1:nrow(visits))[day$productsAvailable],]
   visits$prodID <- with(visits, ave(rep(id==id, id, cumsum))

You can then decide a function that gives you, for each row, a probability of the customer purchasing that item (based on day, customer, product). 然后,您可以决定一个功能,为每行提供一个客户购买该项目的概率(基于日期,客户,产品)。 And then fill in the purchase by `visits$didTheyPurchase <- runif(nrow(visits)) < XmyProbability. 然后通过`访问$ didTheyPurchase < - runif(nrow(visits))<XmyProbability填写购买。

Sorry, there's probably typos's littered throughout this as I was typing it straight, but hopefully this gives you an idea. 对不起,因为我正在直接打字,所以这可能是拼写错误,但希望这会给你一个想法。

Following Gavin, I solved the issue with the following code: 在加文之后,我用以下代码解决了这个问题:

First instantiate the CustomerTypes: 首先实例化CustomerTypes:

CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15)   # Probability for being in each group.

Set the parameters for my customer types 设置我的客户类型的参数

set.seed(1)   # Set seed to make reproducible
Parameters <- data.frame(
  CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
  BySearchEngine   = c(0.10, .40, 0.50, 0.6), # Probability of choosing channel X
  ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
  ByPartnerBlog    = c(0.30, .30,  0.35, 0.35),
  Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
  Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.

Describe the number of visitors 描述访客人数

TotalVisits <- 20000
NumDays <- 100
StartDate <- as.Date("2009-01-04")
NumProducts <- 100
StartProductRelease <- as.Date("2007-01-04") # As products will be selected based on     this, make sure
                                             # we include a few years prior as people will buy products older than 2 years?
AnnualGrowth <- 0.15

Now, as suggested, build a dataset of days. 现在,按照建议,构建一个天数据集。 I added DaysSinceStart to use it in growing the business over time. 我添加了DaysSinceStart,用它来发展业务。

days <- data.frame(
  day            = StartDate+1:NumDays, 
  DaysSinceStart = StartDate+1:NumDays - StartDate,
  CustomerRate = TotalVisits/NumDays)

days$nPurchases <- rpois(NumDays, days$CustomerRate)
days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <- # Increase sales in weekends
  as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*1.5)

Now build transactions from these days. 现在建立这些天的交易。

Transactions <- data.frame(
  ID           = 1:sum(days$nPurchases),
  Date         = rep(days$day, times=days$nPurchases),
  CustomerType = sample(CustomerTypes, sum(days$nPurchases), replace=TRUE, prob=PropCustTypes),
  NewCustomer  = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)),
  CustomerID   = NA,
  ProductID = NA,
  ReferredBy = NA)
Transactions$CustomerType <- as.character(Transactions$CustomerType)

Transactions <- merge(Transactions,Parameters, by="CustomerType") # Append probabilities to table for use in 'sample', haven't found a better way to vlookup?

Initiate some customers we can choose from when not new. 启动一些客户,我们可以在不新的时候选择。

Customers <- data.frame(ID=(1:100), 
                        CustomerType = sample(CustomerTypes, size=100,
                                              replace=TRUE, prob=PropCustTypes)
); Customers$CustomerType <- as.character(Customers$CustomerType)
# Now make a new customer if transaction is with new customer, otherwise choose one with the right type.

Make up a buch of products to choose from, with evenly divided release dates 组成一系列可供选择的产品,并将发布日期分开

ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease))
Upper <- max(ReleaseRange)
Lower <- min(ReleaseRange)
Products <- data.frame(
  ID = 1:NumProducts,
  DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts,
  SuggestedPrice = rnorm(NumProducts, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10   # Cap ProductPrice at 10$

ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog")

Now I loop over the newly created Transaction data.frame, choosing from available products (measured by purchase date - average timeliness (in months) * 30 days +/- 15 days. I also assign new customers to a new CustomerID and choose from existing customers if it is not new. Other fields are determined by the parameters above. 现在我循环新创建的Transaction data.frame,从可用产品中选择(按购买日期衡量 - 平均及时性(以月为单位)* 30天+/- 15天。我还将新客户分配给新的CustomerID并从现有客户中选择客户,如果不是新的。其他字段由上述参数决定。

Start.time <- Sys.time()
for (i in 1:length(Transactions$ID)){

  if (Transactions[i,]$NewCustomer==1){
    NewCustomerID <- max(Customers$ID, na.rm=T)+1
    Customers[NewCustomerID,]$ID = NewCustomerID
    Transactions[i,]$CustomerID <- NewCustomerID
    Customers[NewCustomerID,]$CustomerType <- Transactions[i,]$CustomerType
  if (Transactions[i,]$NewCustomer==0){
    Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
  Transactions[i,]$Discount <- rnorm(1,Transactions[i,]$Discount,Transactions[i,]$Discount/20)
  Transactions[i,]$Timeliness <- rnorm(1,Transactions[i,]$Timeliness, Transactions[i,]$Timeliness/6)
  Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE,
                               prob=Current[,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])

  CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
  ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
  Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
Elapsed <- Sys.time()-Start.time

And it's done! 它已经完成了! Unfortunately it takes ~ 22 minutes on a dataset of 20,000 products sold in 100 days. 不幸的是,在100天内销售的20,000件产品的数据集上需要大约22分钟。 Not necessarily a problem, but I'm very much interested in potential improvements. 不一定是个问题,但我对潜在的改进非常感兴趣。

