简体   繁体   中英

Predicting categorical variables using continuous and categorical variables

I have a set of tree plot data that looks like this (a mix of categorical and continuous variables):

Climate  Species    Average_size    Canopy_cover    Structure
Hot      Pine       12.3            10%             open
Cold     Spruce     15.6            65%             closed
Cold     Fir        19.2            43%             closed

I have a second dataset for which I am trying to predict "Structure" (a categorical variable):

Climate  Species    Average_size    Canopy_cover    Structure
Hot      Pine       20.4            90%             ?
Cold     Spruce     18.9            54%             ?
Hot      Fir        26.4            28%             ?

Since I am predicting a categorical variable, I have tried using ANOVA and predict, with no luck. Am I on the right track?

aov1 <- aov(Structure ~ Canopy_cover + Average_size + Species + Climate, data = df)

predict(aov1, data.frame(Canopy_cover = 90 + Average_size = 20.4 + Species = "Pine" + Climate = "Hot")

A couple of things with this. First, your variable canopy_cover will be read as a character variable (as it is presented above). You likely want this as a continuous, numeric variable instead (see below for how to modify). The larger problem here is trying to model a categorical response using ANOVA, which is essentially a wrapper around linear regression. Linear regression requires a continuous response. From what I can tell, your response variable takes 2 forms, open or closed, so one approach is to use logistic regression. You will need to first convert structure to either 1 or 0.

Loading your data and modifying it so "open" is coded as 1 and "closed" is coded as 0, and converting cover to numeric.

df1 <- tribble(
  ~climate, ~species, ~size, ~cover, ~structure,
  "hot", "pine", 12.3, "10%", "open",
  "cold", "spruce", 15.6, "65%", "closed",
  "cold", "fir", 19.2, "43%", "closed"
) %>%
  mutate(target = case_when(
    structure == "open" ~ 1,
    TRUE ~ 0),
    cover = as.numeric(gsub("%", "", cover))
  )  

Do the same for your test data.

df2 <- tribble(
  ~climate, ~species, ~size, ~cover,
  "hot", "pine", 20.4, "90%", 
  "cold", "spruce", 18.9, "54%", 
  "hot", "fir", 26.4, "28%"
) %>%
  mutate(cover = as.numeric(gsub("%", "", cover)))

Fit a logistic regression model with df1 :

fit <- glm(target ~ climate + species + size + cover, family = "binomial", data = df1)

Predict using df2 :

predict(fit, df2, type = "response")

Which gives the predicted probabilities below. There is also a rank deficiency warning because the model above is rank-deficient, but I assume this won't be the case with real data.

           1            2            3 
1.000000e+00 5.826215e-11 1.000000e+00 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM