[英]Predicting categorical variables using continuous and categorical variables
I have a set of tree plot data that looks like this (a mix of categorical and continuous variables):我有一组看起来像这样的树 plot 数据(分类变量和连续变量的混合):
Climate Species Average_size Canopy_cover Structure
Hot Pine 12.3 10% open
Cold Spruce 15.6 65% closed
Cold Fir 19.2 43% closed
I have a second dataset for which I am trying to predict "Structure" (a categorical variable):我有第二个数据集,我试图预测“结构”(一个分类变量):
Climate Species Average_size Canopy_cover Structure
Hot Pine 20.4 90% ?
Cold Spruce 18.9 54% ?
Hot Fir 26.4 28% ?
Since I am predicting a categorical variable, I have tried using ANOVA and predict, with no luck.由于我正在预测一个分类变量,因此我尝试使用 ANOVA 并进行预测,但没有运气。 Am I on the right track?
我在正确的轨道上吗?
aov1 <- aov(Structure ~ Canopy_cover + Average_size + Species + Climate, data = df)
predict(aov1, data.frame(Canopy_cover = 90 + Average_size = 20.4 + Species = "Pine" + Climate = "Hot")
A couple of things with this.这有几件事。 First, your variable
canopy_cover
will be read as a character variable (as it is presented above).首先,您的变量
canopy_cover
将被读取为字符变量(如上所示)。 You likely want this as a continuous, numeric variable instead (see below for how to modify).您可能希望将其作为一个连续的数字变量(有关如何修改,请参见下文)。 The larger problem here is trying to model a categorical response using ANOVA, which is essentially a wrapper around linear regression.
这里更大的问题是尝试使用 ANOVA 对 model 分类响应,这本质上是线性回归的包装器。 Linear regression requires a continuous response.
线性回归需要连续响应。 From what I can tell, your response variable takes 2 forms, open or closed, so one approach is to use logistic regression.
据我所知,您的响应变量需要 2 forms,打开或关闭,因此一种方法是使用逻辑回归。 You will need to first convert structure to either 1 or 0.
您需要先将结构转换为 1 或 0。
Loading your data and modifying it so "open" is coded as 1 and "closed" is coded as 0, and converting cover
to numeric.加载数据并对其进行修改,使“打开”编码为 1,“关闭”编码为 0,并将
cover
转换为数字。
df1 <- tribble(
~climate, ~species, ~size, ~cover, ~structure,
"hot", "pine", 12.3, "10%", "open",
"cold", "spruce", 15.6, "65%", "closed",
"cold", "fir", 19.2, "43%", "closed"
) %>%
mutate(target = case_when(
structure == "open" ~ 1,
TRUE ~ 0),
cover = as.numeric(gsub("%", "", cover))
)
Do the same for your test data.对您的测试数据执行相同的操作。
df2 <- tribble(
~climate, ~species, ~size, ~cover,
"hot", "pine", 20.4, "90%",
"cold", "spruce", 18.9, "54%",
"hot", "fir", 26.4, "28%"
) %>%
mutate(cover = as.numeric(gsub("%", "", cover)))
Fit a logistic regression model with df1
:用
df1
拟合逻辑回归 model :
fit <- glm(target ~ climate + species + size + cover, family = "binomial", data = df1)
Predict using df2
:使用
df2
进行预测:
predict(fit, df2, type = "response")
Which gives the predicted probabilities below.这给出了下面的预测概率。 There is also a rank deficiency warning because the model above is rank-deficient, but I assume this won't be the case with real data.
还有一个排名不足的警告,因为上面的 model 是排名不足的,但我认为实际数据不会出现这种情况。
1 2 3
1.000000e+00 5.826215e-11 1.000000e+00
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.