简体   繁体   English

基于字符串的J48算法绘制决策树进行预测

[英]Plot decision tree based on strings with J48 algorithm for prediction

I'm trying to plot J48 decision tree based on string values attributes and predict the target variable(categorical), I have seen many examples plotting decision tree based on numerical values but i haven't come across based on strings. 我正在尝试根据字符串值属性绘制J48决策树并预测目标变量(类别),我已经看到许多示例,这些示例基于数值绘制决策树,但我还没有基于字符串。

Here is sample data set, J48 decision tree works fine. 这是示例数据集,J48决策树工作正常。

library(RWeka)
library(party)

MyData2 <- read.csv(file="iris.csv", header=TRUE, sep=",")
m3 <- J48(species~ ., data = MyData2)`enter code here`
if(require("party", quietly = TRUE)) plot(m3)


sepal_length    sepal_width petal_length    petal_width     species
5.1           3.5             1.4             0.2            setosa
4.9           3               1.4             0.2            setosa
7             3.2             4.7             1.4            versicolor
6.4           3.2             4.5             1.5            versicolor
6.3           3.3             6               2.5            virginica
5.8           2.7             5.1             1.9            virginica

If i rename the header sepal_length, sepal_width and to sepal_color and have values as "white", "black" with different combinations of colors to setosa, versicolor and virginca, how do i plot decision tree and predict the target species value. 如果我将标头sepal_length,sepaal_width重命名为sepal_color并将值分别设置为“ white”,“ black”,并且颜色不同,则分别是setosa,versicolor和virginca,我该如何绘制决策树并预测目标物种值。

Suppose if i have data set like below, 假设我有如下数据集,

 sepal_color    sepal_color petal_color petal_color species
    white         black       white        black    setosa
    white         yellow      white        yellow   versicolor
    green         brown       green        brown    virginica

If the string variables represent levels of a categorical variable, then they should be turned into a factor() in R. Then, J48() can deal with these appropriately (just like other regression functions). 如果字符串变量表示分类变量的级别,则应将它们转换为R中的factor() 。然后, J48()可以适当地处理这些变量(就像其他回归函数一样)。

However, if the strings contain free text, then these are not supported directly. 但是,如果字符串包含自由文本,则不直接支持这些文本。 A feature preprocessing to some numeric or factor variable would be necessary before calling J48() . 在调用J48()之前,必须对某些数字变量或因子变量进行特征预处理。

As an example for classification based on categorical variables, let's turn the variables in the iris data into factors with three levels low , medium , high (cutting each variable into three equally-sized groups at the corresponding quantiles): 作为基于分类变量进行分类的示例,让我们将iris数据中的变量转换为三个级别,分别为lowmediumhigh (将每个变量在相应的分位数处分成三个大小相等的组):

## load data and convert to factors via cut()
data("iris", package = "datasets")
for(i in 1:4) iris[[i]] <- cut(iris[[i]],
  quantile(iris[[i]], 0:3/3),
  labels = c("low", "medium", "high")
)
head(iris, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          low        high          low         low  setosa
## 2          low      medium          low         low  setosa
## 3          low      medium          low         low  setosa

## fit and plot J4.8 tree
j48 <- J48(Species ~ ., data = iris)
plot(j48)

J4.8树

Does the algorithm allow string regressors? 该算法是否允许字符串回归器? I tried it and it threw an error. 我试过了,它抛出了一个错误。 With strings you could try one-hot encoding eg "White"=1; 使用字符串,您可以尝试使用一键编码,例如“ White” = 1; "Black"=2 etc. eg “黑色” = 2等,例如

MyData2 <- iris
MyData2$Colour <- 2
MyData2[MyData2$Species == "setosa", ]$Colour <- 1
m3 <- J48(Species~ ., data = MyData2)
plot(m3)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM