[英]Generating dummy webshop data in R: Incorporating parameters when randomly generating transactions
对于我目前正在进行的课程,我正在尝试构建虚拟交易,客户和产品数据集,以在网上商店环境和金融仪表板中展示机器学习用例; 不幸的是,我们还没有得到虚拟数据。 我认为这是提高我的R知识的好方法,但在实现它时遇到了严重的困难。
我的想法是指定一些参数/规则(任意/虚构,但适用于某种聚类算法的演示)。 我基本上试图隐藏一个模式,然后利用机器学习重新找到这种模式(不是这个问题的一部分)。 我隐藏的模式基于产品采用生命周期,试图展示如何识别不同的客户类型以用于目标营销目的。
我将展示我正在寻找的东西。 我想尽可能保持现实。 我试图通过将每个客户的交易数量和其他特征分配给正态分布来实现这一目的; 我对其他潜在的方法完全开放吗?
以下是我到目前为止,首先建立一个客户表:
# Define Customer Types & Respective probabilities
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15) # Probability of being in each group.
set.seed(1) # Set seed to make reproducible
Customers <- data.frame(ID=(1:10000),
CustomerType = sample(CustomerTypes, size=10000,
replace=TRUE, prob=PropCustTypes),
NumBought = rnorm(10000,3,2) # Number of Transactions to Generate, open to alternative solutions?
)
Customers[Customers$Numbought<0]$NumBought <- 0 # Cap NumBought at 0
接下来,生成可供选择的产品表:
Products <- data.frame(
ID=(1:50),
DateReleased = rep(as.Date("2012-12-12"),50)+rnorm(50,0,8000),
SuggestedPrice = rnorm(50, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10 # Cap ProductPrice at 10$
Products[Products$DateReleased<as.Date("2013-04-10"),]$DateReleased <- as.Date("2013-04-10") # Cap Releasedate to 1 year ago
现在我想基于当前相关的每个变量的以下参数生成n个事务(数字在上面的客户表中)。
Parameters <- data.frame(
CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability of coming through channel X
ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
ByPartnerBlog = c(0.30, .30, 0.35, 0.35),
Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
stringsAsFactors=FALSE)
Parameters
CustomerType BySearchEngine ByDirectCustomer ByPartnerBlog Timeliness Discount
1 EarlyAdopter 0.1 0.60 0.30 1 0.00
2 Pragmatists 0.4 0.30 0.30 6 0.00
3 Conservatives 0.5 0.15 0.35 12 0.05
4 Dealseeker 0.6 0.05 0.35 12 0.10
我们的想法是,'EarlyAdopters'将(平均而言,正常分布)10%的交易带有标签'BySearchEngine',60%'ByDirectCustomer'和30%'ByPartnerBlog'; 这些值需要相互排斥:一个不能通过PartnerBlog和最终数据集中的搜索引擎获得。 选项是:
ObtainedBy <- c("SearchEngine","DirectCustomer","PartnerBlog")
此外,我想生成一个折扣变量,通常使用上述方法分配。 为简单起见,标准偏差可能是平均值/ 5。
接下来,我最棘手的部分,我想根据一些规则生成这些事务:
其他参数:
YearlyMax <- 1 # ? How would I specify this, a growing number would be even nicer?
DailyMax <- 1 # Same question? Likely dependent on YearlyMax
CustomerID 2的结果是:
Transactions <- data.frame(
ID = c(1,2),
CustomerID = c(2,2), # The customer that bought the item.
ProductID = c(51,100), # Products chosen to approach customer type's Timeliness average
DateOfPurchase = c("2013-01-02", "2012-12-03"), # Date chosen to mimic timeliness average
ReferredBy = c("DirectCustomer", "SearchEngine"), # See above, follows proportions previously identified.
GrossPrice = c(50,52.99), # based on Product Price, no real restrictions other than using it for my financial dashboard.
Discount = c(0.02, 0.0)) # Chosen to mimic customer type's discount behavior.
Transactions
ID CustomerID ProductID DateOfPurchase ReferredBy GrossPrice Discount
1 1 2 51 2013-01-02 DirectCustomer 50.00 0.02
2 2 2 100 2012-12-03 SearchEngine 52.99 0.00
我对编写R代码越来越有信心了,但是我在编写代码时遇到了困难,无法保存全局参数(每日交易分配,每个客户每年最多#个交易)以及各种链接:
这使我不知道是否应该在customer表上写一个for循环,为每个客户生成事务,或者我是否应该采用不同的路由。 非常感谢任何贡献。 我也欢迎其他虚拟数据集,即使我渴望通过R解决这个问题。我会随着我的进展更新这篇文章。
我目前的伪代码:
编辑:生成事务表,现在我只需要用正确的数据填充它:
Tr <- data.frame(
ID = 1:sum(Customers$NumBought),
CustomerID = NA,
DateOfPurchase = NA,
ReferredBy = NA,
GrossPrice=NA,
Discount=NA)
非常粗略地,设置一天的数据库和当天的访问次数:
days<- data.frame(day=1:8000, customerRate = 8000/XtotalNumberOfVisits)
# you could change the customerRate to reflect promotions, time since launch, ...
days$nVisits <- rpois(8000, days$customerRate)
然后对访问进行编目
visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
visits$nPurchases <- rpois(nrow(vists), XpurchaseRate[visits$customerType])
在它们前面带有X
任何变量都是过程的参数。 根据您拥有的其他列,您可以通过参数化可用对象之间的相对可能性来继续生成事务数据库。 或者,您可以生成一个访问数据库,其中包括当天可用的每个产品的密钥:
productRelease <- data.frame(id=X, releaseDay=sort(X)) # ie df is sorted by releaseDay
visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
day$productsAvailable = rep(1:nrow(productRelease), times=diff(c(productRelease$releaseDay, nrow(days)+1)))
visits <- visits[(1:nrow(visits))[day$productsAvailable],]
visits$prodID <- with(visits, ave(rep(id==id, id, cumsum))
然后,您可以决定一个功能,为每行提供一个客户购买该项目的概率(基于日期,客户,产品)。 然后通过`访问$ didTheyPurchase < - runif(nrow(visits))<XmyProbability填写购买。
对不起,因为我正在直接打字,所以这可能是拼写错误,但希望这会给你一个想法。
在加文之后,我用以下代码解决了这个问题:
首先实例化CustomerTypes:
require(lubridate)
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15) # Probability for being in each group.
设置我的客户类型的参数
set.seed(1) # Set seed to make reproducible
Parameters <- data.frame(
CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability of choosing channel X
ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
ByPartnerBlog = c(0.30, .30, 0.35, 0.35),
Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
stringsAsFactors=FALSE)
描述访客人数
TotalVisits <- 20000
NumDays <- 100
StartDate <- as.Date("2009-01-04")
NumProducts <- 100
StartProductRelease <- as.Date("2007-01-04") # As products will be selected based on this, make sure
# we include a few years prior as people will buy products older than 2 years?
AnnualGrowth <- 0.15
现在,按照建议,构建一个天数据集。 我添加了DaysSinceStart,用它来发展业务。
days <- data.frame(
day = StartDate+1:NumDays,
DaysSinceStart = StartDate+1:NumDays - StartDate,
CustomerRate = TotalVisits/NumDays)
days$nPurchases <- rpois(NumDays, days$CustomerRate)
days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <- # Increase sales in weekends
as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*1.5)
现在建立这些天的交易。
Transactions <- data.frame(
ID = 1:sum(days$nPurchases),
Date = rep(days$day, times=days$nPurchases),
CustomerType = sample(CustomerTypes, sum(days$nPurchases), replace=TRUE, prob=PropCustTypes),
NewCustomer = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)),
CustomerID = NA,
ProductID = NA,
ReferredBy = NA)
Transactions$CustomerType <- as.character(Transactions$CustomerType)
Transactions <- merge(Transactions,Parameters, by="CustomerType") # Append probabilities to table for use in 'sample', haven't found a better way to vlookup?
启动一些客户,我们可以在不新的时候选择。
Customers <- data.frame(ID=(1:100),
CustomerType = sample(CustomerTypes, size=100,
replace=TRUE, prob=PropCustTypes)
); Customers$CustomerType <- as.character(Customers$CustomerType)
# Now make a new customer if transaction is with new customer, otherwise choose one with the right type.
组成一系列可供选择的产品,并将发布日期分开
ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease))
Upper <- max(ReleaseRange)
Lower <- min(ReleaseRange)
Products <- data.frame(
ID = 1:NumProducts,
DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts,
as.numeric(Upper-Lower),
as.numeric(Upper-Lower)/NumProducts))),
SuggestedPrice = rnorm(NumProducts, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10 # Cap ProductPrice at 10$
ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog")
现在我循环新创建的Transaction data.frame,从可用产品中选择(按购买日期衡量 - 平均及时性(以月为单位)* 30天+/- 15天。我还将新客户分配给新的CustomerID并从现有客户中选择客户,如果不是新的。其他字段由上述参数决定。
Start.time <- Sys.time()
for (i in 1:length(Transactions$ID)){
if (Transactions[i,]$NewCustomer==1){
NewCustomerID <- max(Customers$ID, na.rm=T)+1
Customers[NewCustomerID,]$ID = NewCustomerID
Transactions[i,]$CustomerID <- NewCustomerID
Customers[NewCustomerID,]$CustomerType <- Transactions[i,]$CustomerType
}
if (Transactions[i,]$NewCustomer==0){
Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
1,replace=FALSE)
}
Transactions[i,]$Discount <- rnorm(1,Transactions[i,]$Discount,Transactions[i,]$Discount/20)
Transactions[i,]$Timeliness <- rnorm(1,Transactions[i,]$Timeliness, Transactions[i,]$Timeliness/6)
Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE,
prob=Current[,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])
CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}
Elapsed <- Sys.time()-Start.time
length(Transactions$ID)
它已经完成了! 不幸的是,在100天内销售的20,000件产品的数据集上需要大约22分钟。 不一定是个问题,但我对潜在的改进非常感兴趣。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.