[英]Generating dummy webshop data in R: Incorporating parameters when randomly generating transactions
對於我目前正在進行的課程,我正在嘗試構建虛擬交易,客戶和產品數據集,以在網上商店環境和金融儀表板中展示機器學習用例; 不幸的是,我們還沒有得到虛擬數據。 我認為這是提高我的R知識的好方法,但在實現它時遇到了嚴重的困難。
我的想法是指定一些參數/規則(任意/虛構,但適用於某種聚類算法的演示)。 我基本上試圖隱藏一個模式,然后利用機器學習重新找到這種模式(不是這個問題的一部分)。 我隱藏的模式基於產品采用生命周期,試圖展示如何識別不同的客戶類型以用於目標營銷目的。
我將展示我正在尋找的東西。 我想盡可能保持現實。 我試圖通過將每個客戶的交易數量和其他特征分配給正態分布來實現這一目的; 我對其他潛在的方法完全開放嗎?
以下是我到目前為止,首先建立一個客戶表:
# Define Customer Types & Respective probabilities
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15) # Probability of being in each group.
set.seed(1) # Set seed to make reproducible
Customers <- data.frame(ID=(1:10000),
CustomerType = sample(CustomerTypes, size=10000,
replace=TRUE, prob=PropCustTypes),
NumBought = rnorm(10000,3,2) # Number of Transactions to Generate, open to alternative solutions?
)
Customers[Customers$Numbought<0]$NumBought <- 0 # Cap NumBought at 0
接下來,生成可供選擇的產品表:
Products <- data.frame(
ID=(1:50),
DateReleased = rep(as.Date("2012-12-12"),50)+rnorm(50,0,8000),
SuggestedPrice = rnorm(50, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10 # Cap ProductPrice at 10$
Products[Products$DateReleased<as.Date("2013-04-10"),]$DateReleased <- as.Date("2013-04-10") # Cap Releasedate to 1 year ago
現在我想基於當前相關的每個變量的以下參數生成n個事務(數字在上面的客戶表中)。
Parameters <- data.frame(
CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability of coming through channel X
ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
ByPartnerBlog = c(0.30, .30, 0.35, 0.35),
Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
stringsAsFactors=FALSE)
Parameters
CustomerType BySearchEngine ByDirectCustomer ByPartnerBlog Timeliness Discount
1 EarlyAdopter 0.1 0.60 0.30 1 0.00
2 Pragmatists 0.4 0.30 0.30 6 0.00
3 Conservatives 0.5 0.15 0.35 12 0.05
4 Dealseeker 0.6 0.05 0.35 12 0.10
我們的想法是,'EarlyAdopters'將(平均而言,正常分布)10%的交易帶有標簽'BySearchEngine',60%'ByDirectCustomer'和30%'ByPartnerBlog'; 這些值需要相互排斥:一個不能通過PartnerBlog和最終數據集中的搜索引擎獲得。 選項是:
ObtainedBy <- c("SearchEngine","DirectCustomer","PartnerBlog")
此外,我想生成一個折扣變量,通常使用上述方法分配。 為簡單起見,標准偏差可能是平均值/ 5。
接下來,我最棘手的部分,我想根據一些規則生成這些事務:
其他參數:
YearlyMax <- 1 # ? How would I specify this, a growing number would be even nicer?
DailyMax <- 1 # Same question? Likely dependent on YearlyMax
CustomerID 2的結果是:
Transactions <- data.frame(
ID = c(1,2),
CustomerID = c(2,2), # The customer that bought the item.
ProductID = c(51,100), # Products chosen to approach customer type's Timeliness average
DateOfPurchase = c("2013-01-02", "2012-12-03"), # Date chosen to mimic timeliness average
ReferredBy = c("DirectCustomer", "SearchEngine"), # See above, follows proportions previously identified.
GrossPrice = c(50,52.99), # based on Product Price, no real restrictions other than using it for my financial dashboard.
Discount = c(0.02, 0.0)) # Chosen to mimic customer type's discount behavior.
Transactions
ID CustomerID ProductID DateOfPurchase ReferredBy GrossPrice Discount
1 1 2 51 2013-01-02 DirectCustomer 50.00 0.02
2 2 2 100 2012-12-03 SearchEngine 52.99 0.00
我對編寫R代碼越來越有信心了,但是我在編寫代碼時遇到了困難,無法保存全局參數(每日交易分配,每個客戶每年最多#個交易)以及各種鏈接:
這使我不知道是否應該在customer表上寫一個for循環,為每個客戶生成事務,或者我是否應該采用不同的路由。 非常感謝任何貢獻。 我也歡迎其他虛擬數據集,即使我渴望通過R解決這個問題。我會隨着我的進展更新這篇文章。
我目前的偽代碼:
編輯:生成事務表,現在我只需要用正確的數據填充它:
Tr <- data.frame(
ID = 1:sum(Customers$NumBought),
CustomerID = NA,
DateOfPurchase = NA,
ReferredBy = NA,
GrossPrice=NA,
Discount=NA)
非常粗略地,設置一天的數據庫和當天的訪問次數:
days<- data.frame(day=1:8000, customerRate = 8000/XtotalNumberOfVisits)
# you could change the customerRate to reflect promotions, time since launch, ...
days$nVisits <- rpois(8000, days$customerRate)
然后對訪問進行編目
visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
visits$nPurchases <- rpois(nrow(vists), XpurchaseRate[visits$customerType])
在它們前面帶有X
任何變量都是過程的參數。 根據您擁有的其他列,您可以通過參數化可用對象之間的相對可能性來繼續生成事務數據庫。 或者,您可以生成一個訪問數據庫,其中包括當天可用的每個產品的密鑰:
productRelease <- data.frame(id=X, releaseDay=sort(X)) # ie df is sorted by releaseDay
visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
day$productsAvailable = rep(1:nrow(productRelease), times=diff(c(productRelease$releaseDay, nrow(days)+1)))
visits <- visits[(1:nrow(visits))[day$productsAvailable],]
visits$prodID <- with(visits, ave(rep(id==id, id, cumsum))
然后,您可以決定一個功能,為每行提供一個客戶購買該項目的概率(基於日期,客戶,產品)。 然后通過`訪問$ didTheyPurchase < - runif(nrow(visits))<XmyProbability填寫購買。
對不起,因為我正在直接打字,所以這可能是拼寫錯誤,但希望這會給你一個想法。
在加文之后,我用以下代碼解決了這個問題:
首先實例化CustomerTypes:
require(lubridate)
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15) # Probability for being in each group.
設置我的客戶類型的參數
set.seed(1) # Set seed to make reproducible
Parameters <- data.frame(
CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability of choosing channel X
ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
ByPartnerBlog = c(0.30, .30, 0.35, 0.35),
Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
stringsAsFactors=FALSE)
描述訪客人數
TotalVisits <- 20000
NumDays <- 100
StartDate <- as.Date("2009-01-04")
NumProducts <- 100
StartProductRelease <- as.Date("2007-01-04") # As products will be selected based on this, make sure
# we include a few years prior as people will buy products older than 2 years?
AnnualGrowth <- 0.15
現在,按照建議,構建一個天數據集。 我添加了DaysSinceStart,用它來發展業務。
days <- data.frame(
day = StartDate+1:NumDays,
DaysSinceStart = StartDate+1:NumDays - StartDate,
CustomerRate = TotalVisits/NumDays)
days$nPurchases <- rpois(NumDays, days$CustomerRate)
days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <- # Increase sales in weekends
as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*1.5)
現在建立這些天的交易。
Transactions <- data.frame(
ID = 1:sum(days$nPurchases),
Date = rep(days$day, times=days$nPurchases),
CustomerType = sample(CustomerTypes, sum(days$nPurchases), replace=TRUE, prob=PropCustTypes),
NewCustomer = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)),
CustomerID = NA,
ProductID = NA,
ReferredBy = NA)
Transactions$CustomerType <- as.character(Transactions$CustomerType)
Transactions <- merge(Transactions,Parameters, by="CustomerType") # Append probabilities to table for use in 'sample', haven't found a better way to vlookup?
啟動一些客戶,我們可以在不新的時候選擇。
Customers <- data.frame(ID=(1:100),
CustomerType = sample(CustomerTypes, size=100,
replace=TRUE, prob=PropCustTypes)
); Customers$CustomerType <- as.character(Customers$CustomerType)
# Now make a new customer if transaction is with new customer, otherwise choose one with the right type.
組成一系列可供選擇的產品,並將發布日期分開
ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease))
Upper <- max(ReleaseRange)
Lower <- min(ReleaseRange)
Products <- data.frame(
ID = 1:NumProducts,
DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts,
as.numeric(Upper-Lower),
as.numeric(Upper-Lower)/NumProducts))),
SuggestedPrice = rnorm(NumProducts, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10 # Cap ProductPrice at 10$
ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog")
現在我循環新創建的Transaction data.frame,從可用產品中選擇(按購買日期衡量 - 平均及時性(以月為單位)* 30天+/- 15天。我還將新客戶分配給新的CustomerID並從現有客戶中選擇客戶,如果不是新的。其他字段由上述參數決定。
Start.time <- Sys.time()
for (i in 1:length(Transactions$ID)){
if (Transactions[i,]$NewCustomer==1){
NewCustomerID <- max(Customers$ID, na.rm=T)+1
Customers[NewCustomerID,]$ID = NewCustomerID
Transactions[i,]$CustomerID <- NewCustomerID
Customers[NewCustomerID,]$CustomerType <- Transactions[i,]$CustomerType
}
if (Transactions[i,]$NewCustomer==0){
Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
1,replace=FALSE)
}
Transactions[i,]$Discount <- rnorm(1,Transactions[i,]$Discount,Transactions[i,]$Discount/20)
Transactions[i,]$Timeliness <- rnorm(1,Transactions[i,]$Timeliness, Transactions[i,]$Timeliness/6)
Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE,
prob=Current[,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])
CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}
Elapsed <- Sys.time()-Start.time
length(Transactions$ID)
它已經完成了! 不幸的是,在100天內銷售的20,000件產品的數據集上需要大約22分鍾。 不一定是個問題,但我對潛在的改進非常感興趣。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.