简体   繁体   English

R和MovieLense数据集中的RWeka J48分类问题

[英]RWeka J48 Classification issue in R and MovieLense data sets

I wanna classify Movielense users table demographic data but the result of J48 is weird, I classify my data with C5.0 and every thing was fine But I must work on this algorithm (j48) 我想对Movielense用户表的人口统计数据进行分类,但是J48的结果很奇怪,我使用C5.0对数据进行分类,一切都很好,但是我必须使用此算法(j48)

structure of my data is like below 我的数据结构如下

$ user_id   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ age       : Factor w/ 7 levels "1","18","25",..: 1 7 3 5 3 6 4 3 3 4 ...
 $ occupation: Factor w/ 21 levels "0","1","2","3",..: 11 17 16 8 21 10 2 13 18 2 ...
 $ gender    : Factor w/ 2 levels "F","M": 1 2 2 2 2 1 2 2 2 1 ...
 $ Class     : Factor w/ 4 levels "1","2","3","4": 2 2 2 2 3 2 2 2 2 4 ...

and head of data is 数据头是

head(data)
  user_id age occupation gender Class
1       1   1         10      F     2
2       2  56         16      M     2
3       3  25         15      M     2
4       4  45          7      M     2
5       5  25         20      M     3
6       6  50          9      F     2

all column except user_id are nominal type and should be factor in R user_id之外的所有列均为nominal type ,应为factor in R

Code for classification: 分类代码:

library(RWeka)
fit <- J48(data$Class~., data=data[,-c(1)], control = Weka_control(C=0.25))
currentUserClass = predict(fit,data[,-c(1)])
table(currentUserClass , data$Class)

and wrong table of summary result is 错误的汇总结果表是

currentUserClass    1    2    3    4
               1    0    0    0    0
               2  216 3630 1549  645
               3    0    0    0    0
               4    0    0    0    0

When I fit my model with C5.0 result is like below that I except from both algorithm 当我将模型与C5.0拟合时,结果如下所示,除了两种算法之外

predictions    1    2    3    4
          1  216    0    0    0
          2    0 3630    0    0
          3    0    0 1549    0
          4    0    0    0  645

More Try 更多尝试

  1. I change the structure of my data and convert my factor columns to separate columns and nothing changes 我更改了数据的结构,并将因子列转换为单独的列,并且没有任何变化
  2. I change C controller value the result goes a little better in C=0.75 but It's totally wrong 我更改C controller value ,结果在C=0.75会好一点,但这是完全错误的

event after normalization and changing data nothing happened 标准化和更改数据后发生事件

> head(data)
  user_id       age1      age18      age25      age35      age45      age50
1       1  5.1188737 -0.4726289 -0.7289391 -0.4960755 -0.3164894 -0.2990841
2       2 -0.1953231 -0.4726289 -0.7289391 -0.4960755 -0.3164894 -0.2990841
3       3 -0.1953231 -0.4726289  1.3716296 -0.4960755 -0.3164894 -0.2990841
4       4 -0.1953231 -0.4726289 -0.7289391 -0.4960755  3.1591400 -0.2990841
5       5 -0.1953231 -0.4726289  1.3716296 -0.4960755 -0.3164894 -0.2990841
6       6 -0.1953231 -0.4726289 -0.7289391 -0.4960755 -0.3164894  3.3429880
       age56 occupation1 occupation2 occupation3 occupation4 occupation5
1 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
2  3.8590505  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
3 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
4 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
5 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
6 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
  occupation6 occupation7 occupation8 occupation9 occupation10 occupation11
1  -0.2016306  -0.3558574 -0.05312294  -0.1243576    5.4744311   -0.1477163
2  -0.2016306  -0.3558574 -0.05312294  -0.1243576   -0.1826371   -0.1477163
3  -0.2016306  -0.3558574 -0.05312294  -0.1243576   -0.1826371   -0.1477163
4  -0.2016306   2.8096490 -0.05312294  -0.1243576   -0.1826371   -0.1477163
5  -0.2016306  -0.3558574 -0.05312294  -0.1243576   -0.1826371   -0.1477163
6  -0.2016306  -0.3558574 -0.05312294   8.0399919   -0.1826371   -0.1477163
  occupation12 occupation13 occupation14 occupation15 occupation16 occupation17
1   -0.2619865   -0.1551514   -0.2293967   -0.1562667   -0.2038431   -0.3010506
2   -0.2619865   -0.1551514   -0.2293967   -0.1562667    4.9049217   -0.3010506
3   -0.2619865   -0.1551514   -0.2293967    6.3982549   -0.2038431   -0.3010506
4   -0.2619865   -0.1551514   -0.2293967   -0.1562667   -0.2038431   -0.3010506
5   -0.2619865   -0.1551514   -0.2293967   -0.1562667   -0.2038431   -0.3010506
6   -0.2619865   -0.1551514   -0.2293967   -0.1562667   -0.2038431   -0.3010506
  occupation18 occupation19 occupation20    genderM Class
1   -0.1082744   -0.1098287   -0.2208735 -1.5917949     2
2   -0.1082744   -0.1098287   -0.2208735  0.6281176     2
3   -0.1082744   -0.1098287   -0.2208735  0.6281176     2
4   -0.1082744   -0.1098287   -0.2208735  0.6281176     2
5   -0.1082744   -0.1098287    4.5267283  0.6281176     3
6   -0.1082744   -0.1098287   -0.2208735 -1.5917949     2
> fit <- J48(data$Class~., data=data, control = Weka_control(C=0.25))
> currentUserClass = predict(fit,data)
> table(currentUserClass , data$Class)

currentUserClass    1    2    3    4
               1    7    1    2    2
               2  201 3601 1470  617
               3    8   28   75   14
               4    0    0    2   12

J48 is implementing the C4.5 decision tree algorithm . J48正在实现C4.5决策树算法 The performance of C5.0 and C4.5 may differ. C5.0和C4.5的性能可能有所不同。 That said, the parameters of J48 within Weka can be modified (as you have shown in your code above). 就是说,可以修改Weka中J48的参数(如上面的代码所示)。 Perhaps that will help satisfy your needs. 也许那将有助于满足您的需求。

To start, your tree is likely a single leaf predicting class 2. This can be checked by printing the decision tree. 首先,您的树可能是单叶预测类2。可以通过打印决策树进行检查。 The code below does so with the "mtcars" dataset (a built in dataset with R). 下面的代码使用“ mtcars”数据集(带有R的内置数据集)执行此操作。

dat <- mtcars 
dat$carb <- factor(dat$carb)
model1 <- J48(carb ~., data = dat)
model1

However, if the tree is rebuilt with a smaller number of minimum objects in a leaf and no pruning, the tree will be larger. 但是,如果用叶子中最少数量的最小对象重建树并且不进行修剪,则树将更大。

model2 <- J48(carb ~., data = dat, control= Weka_control(M=1,U=TRUE))
model2

The following can be used to check the valid parameters of J48: 以下内容可用于检查J48的有效参数:

WOW(J48)

You should change the default parameters of J48 to fit your particular need. 您应该更改J48的默认参数以适合您的特定需求。 I recommend comparing the parameters used in your C5.0 to the default parameters of J48 and making modifications where necessary. 我建议将C5.0中使用的参数与J48的默认参数进行比较,并在必要时进行修改。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM