简体   繁体   English

通过分类变量中的水平迭代地回归R

[英]Regression in R iteratively by levels in categorical variable

So I have a small data set which should be great for modeling (<1 million records), but one variable is giving me problems. 因此,我有一个较小的数据集,该数据集非常适合建模(<100万条记录),但是一个变量给我带来了问题。 It's a categorical variable with ~98 levels called [store] - this is the name of each store. 这是具有〜98个级别的分类变量,称为[store]-这是每个商店的名称。 I am trying to predict each stores sales [sales] which is a continuous numeric variable. 我试图预测每个商店的销售额[sales],这是一个连续的数字变量。 So the vector size is over 10GB and crashes with memory errors in R. Is it possible to make 98 different regression equations, and run them one by one for every level of [store]? 因此,向量大小超过10GB,并且由于R中的内存错误而崩溃。是否可以制作98个不同的回归方程,并针对每个[store]级别逐个运行它们?

My other idea would be to try and create 10 or 15 clusters of this [store] variable, then use the cluster names as my categorical variable in predicting the [sales] variable (continuous variable). 我的另一个想法是尝试创建此[store]变量的10或15个群集,然后在预测[sales]变量(连续变量)时使用群集名称作为我的分类变量。

Sure, this is a pretty common type of analysis. 当然,这是一种非常常见的分析类型。 For instance, here is how you would split up the iris dataset by the Species variable and then build a separate model predicting Sepal.Width from Sepal.Length in each subset: 例如,这是如何通过Species变量分割虹膜数据集,然后在每个子集中构建一个单独的模型Sepal.Width根据Sepal.Length预测Sepal.Width

data(iris)
models <- lapply(split(iris, iris$Species), function(df) lm(Sepal.Width~Sepal.Length, data=df))

The result is a list of the species-specific regression models. 结果是特定物种回归模型的列表。

To predict, I think it would be most efficient to first split your test set, then call the corresponding prediction function on each subset, and finally recombine: 为了进行预测,我认为首先分割测试集,然后在每个子集上调用相应的预测函数,最后重新组合,将是最有效的方法:

test.iris <- iris
test.spl <- split(test.iris, test.iris$Species)
predictions <- unlist(lapply(test.spl, function(df) {
  predict(models[[df$Species[1]]], newdata=df)
}))
test.ordered <- do.call(rbind, test.spl)  # Test obs. in same order as predictions

Of course, for your problem you'll need to decide how to subset the data. 当然,对于您的问题,您需要确定如何对数据进行子集化。 One reasonable approach would be clustering with something like kmeans and the passing the cluster of each point to the split function. 一种合理的方法是使用kmeans之类进行聚类,然后将每个点的聚类传递给split函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM