简体   繁体   English

R-分析类别变量对连续变量的影响

[英]R - Analyse impact of categorical variables on continuous variable

I'm trying to analyze a data set in R where I have sales of items over time and I want to understand the impact of categorical variables on the quantity sold. 我正在尝试分析R中的数据集,该数据集中我随时间推移的商品销售情况,并且我想了解分类变量对销售数量的影响。

library("data.table")

qty <- c(100,10000,100,200,150,9000)
flavour <- c("Mint","Herb","Mint","Mint","Herb","Fruit")
category <- c("Multiple","Multiple","White","Multiple","Other","White")

sales_data <- data.frame(qty,flavour,category)

str(sales_data)

'data.frame':   6 obs. of  3 variables:
 $ qty     : num  100 10000 100 200 150 9000
 $ flavour : Factor w/ 3 levels "Fruit","Herb",..: 3 2 3 3 2 1
 $ category: Factor w/ 3 levels "Multiple","Other",..: 1 1 3 1 2 3

I've been looking at multipleregressions and simple linear regressions, but I feel I might be on the wrong track. 我一直在研究多元回归和简单的线性回归,但我觉得自己可能走错了路。 My understanding is that I can use a simple linear regression to determine a relationship between 2 continuous variables. 我的理解是,我可以使用简单的线性回归来确定2个连续变量之间的关系。 I can see there is a way to use multiple regressions to understand the relationship between categorical variables and continuous ones but the examples I've found seem to stop at binary values. 我可以看到有一种使用多元回归的方法来理解分类变量和连续变量之间的关系,但是我发现的示例似乎仅限于二进制值。 Does someone smoke or not for example. 例如有人吸烟还是不吸烟。 Given I have multiple values for each categorical variable, is multiple regression the right way to go or have I completely gone off track? 鉴于每个分类变量都有多个值,多元回归是正确的方法还是我完全偏离了轨道?

My actual data set has around 10 categorical variables, some of which relate to location, others which relate to brands. 我的实际数据集包含大约10个类别变量,其中一些与位置有关,其他与品牌有关。

Any help would be greatly appreciated. 任何帮助将不胜感激。 And apologies if this is in the wrong place or I've missed something obvious - I'm learning stats and R at the same time so becoming confused quickly 如果这是在错误的地方,或者我错过了明显的事情,我深表歉意-我正在同时学习统计信息和R,因此很快就感到困惑

You can certainly have a continuous dependent variable ( qty ) and a mix of continuous and categorical predictors and they don't have to be binary. 您当然可以有一个连续的因变量( qty )以及连续和分类预测变量的组合,并且它们不一定是二进制的。 The categorical variables should be of class "factor" . 类别变量应属于"factor"类。 For the two categorical/factor variables shown in the question: 对于问题中显示的两个类别/因子变量:

fm <- lm(qty ~., sales_data)
summary(fm)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM