简体   繁体   English

使用R对大数据样本进行聚类分析

[英]Cluster Analysis using R for large data sample

I am just starting out with segmenting a customer database using RI have for an ecommerce retail business. 我刚开始使用RI have细分电子商务零售业务的客户数据库。 I seek some guidance about the best approach to proceed with for this exercise. 我寻求有关进行此练习的最佳方法的一些指导。 I have searched the topics already posted here and tried them out myself like dist() and hclust(). 我搜索了已经在此处发布的主题,并像dist()和hclust()一样尝试了它们。 However I am running into one issue or another and not able to overcome it since I am new to using R. Here is the brief description of my problem. 但是,由于我不熟悉R,因此我遇到了一个或另一个问题,无法解决。这是我的问题的简要说明。 I have approximately 480K records of customers who have bought so far. 到目前为止,我有大约480K的购买记录。 The data contains following columns: 数据包含以下列:

  • email id 电子邮件ID
  • gender 性别
  • city
  • total transactions so far 到目前为止的总交易
  • average basket value 平均篮子价值
  • average basket size ( no of item purchased during one transaction) 平均购物篮尺寸(一次交易中购买的商品数量)
  • average discount claimed per transaction 每笔交易要求的平均折扣
  • No of days since the user first purchased 自用户首次购买以来的天数
  • Average duration between two purchases 两次购买之间的平均持续时间
  • No of days since last transaction 自上次交易以来的天数

The business goal of this exercise is to identify the most profitable segments and encourage repeat purchases in those segments using campaigns. 此练习的业务目标是确定最有利可图的细分,并鼓励通过广告系列在这些细分中重复购买。 Can I please get some guidance as to how to do this successfully without running into problems like the size of the sample or the data type of columns? 我能否获得一些成功完成操作的指导,而又不会遇到样本大小或列的数据类型之类的问题?

Read this to learn how to subset data frames. 阅读本文以了解如何对数据帧进行子集化。 When you try to define d, it looks like you're providing way to much data, which might be fixed by subsetting your table first. 当您尝试定义d时,您似乎在提供大量数据的方法,这可能是通过先对表进行子设置来解决的。 If not, you might want to take a random sample of your data instead of all of it. 如果不是这样,您可能希望对数据进行随机抽样而不是对所有数据进行抽样。 Suppose you know that columns 4 through 10 of your data frame called cust_data contain numerical data, then you might try this: 假设您知道数据框cust_data 4到10列包含数字数据,那么您可以尝试以下操作:

cust_data2 <- cust_data[, 4:10]
d <- dist(cust_data2)

For large values, you may want to log transform them--just experiment and see what makes sense. 对于较大的值,您可能需要对它们进行对数转换-进行实验,看看有什么意义。 I really am not sure about this, and that's just a suggestion. 我对此确实不确定,这只是一个建议。 Maybe choosing a more appropriate clustering or distance metric would be better. 也许选择一个更合适的聚类或距离度量会更好。

Finally, when you run hclust, you need to pass in the d matrix, and not the original data set. 最后,当您运行hclust时,您需要传递d矩阵,而不是原始数据集。

h <- hclust(d, "ave")

Sadly your data does not contain any attributes that indicate what types of items/transactions did NOT result in a sale. 遗憾的是,您的数据不包含任何属性,这些属性指示哪些类型的项目/交易未促成销售。

I am not sure if clustering is the way to go here. 我不确定群集是否是解决问题的方法。

Here are some ideas: 这里有一些想法:

First split your data into a training set (say 70%) and a test set. 首先将您的数据分成一个训练集(例如70%)和一个测试集。

Set up a simple linear regression model with,say, "average basket value" as a response variable, and all other variables as independent variables. 建立一个简单的线性回归模型,例如将“平均篮子值”作为响应变量,并将所有其他变量作为自变量。

fit <-lm(averagebasketvalue ~., data = custdata)

Run the model on the training set, determine significant attributes (those with at least one star in the summary(fit) output), then focus on those variables. 在训练集上运行模型,确定重要属性(在summary(fit)输出中至少具有一颗星的属性),然后关注这些变量。

Check your regression coefficients on the test set, by calculating R-squared and Sum of squared errors (SSE) on the test set. 通过计算测试集上的R平方和误差平方和(SSE),检查测试集上的回归系数。 You can use the predict() function , the calls will look like 您可以使用predict()函数,调用看起来像

fitpred <- predict(fit, newdata=testset)
summary(fitpred) # will give you R²

Maybe "city" contains too many unique values to be meaningful. 也许“城市”包含太多独特的价值,以至于没有意义。 Try to generalize them by introducing a new attribute CityClass (eg BigCity-MediumCity-SmallCity ... or whatever classification scheme is useful for your cities). 尝试通过引入新的属性CityClass来概括它们(例如BigCity-MediumCity-SmallCity ...或任何对您的城市有用的分类方案)。 You might also condition the model on "gender". 您也可以将模型设为“性别”。 Drop "email id". 删除“电子邮件ID”。

This can go on for a while... play with the model to try to get better R-squared and SSEs. 这可能会持续一段时间...使用模型以尝试获得更好的R平方和SSE。

I think a tree-based model (rpart) might also work well here. 我认为基于树的模型(rpart)在这里也可以很好地工作。

Then you might change to cluster analysis at a later time. 然后,您可能会在以后更改为聚类分析。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM