简体   繁体   中英

Can we use as.factor to convert categorical variables having multiple levels for decision tree or we need to use model.matrix?

I am trying to build a decison tree model in R having both categorical and numerical variables.Some categorical variables have 3 levels, so can I just use as.factor and then use in my model? I tried to use model.matrix but my doubt is model.matrix converts the variable in numeric values of 0s and 1s and splitting happens on basis of these numeric values. For eg if Color has 3 level- blue,red,green, the splitting rule will look like color_green < 0.5 instead it should always take 0s and 1s only.

If you are asking whether you can use factors to build an rpart decision tree. Then yes. See below example from the documentation. Note that there are a lot of possible packages for decision trees.

library(rpart)
rpart(Reliability ~ ., data=car90)
#> n=76 (35 observations deleted due to missingness)
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#>  1) root 76 53 average (0.2 0.12 0.3 0.11 0.28)  
#>    2) Country=Germany,Korea,Mexico,Sweden,USA 49 29 average (0.31 0.18 0.41 0.1 0)  
#>      4) Tires=145,155/80,165/80,185/80,195/60,195/65,195/70,205/60,215/65,225/75,275/40 17  9 Much worse (0.47 0.29 0 0.24 0) *
#>      5) Tires=175/70,185/65,185/70,185/75,195/75,205/70,205/75,215/70 32 12 average (0.22 0.12 0.62 0.031 0)  
#>       10) HP.revs< 4650 13  7 Much worse (0.46 0.23 0.31 0 0) *
#>       11) HP.revs>=4650 19  3 average (0.053 0.053 0.84 0.053 0) *
#>    3) Country=Japan,Japan/USA 27  6 Much better (0 0 0.11 0.11 0.78) *
str(car90)
#> 'data.frame':    111 obs. of  34 variables:
#>  $ Country     : Factor w/ 10 levels "Brazil","England",..: 5 5 4 4 4 4 10 10 10 NA ...
#>  $ Disp        : num  112 163 141 121 152 209 151 231 231 189 ...
#>  $ Disp2       : num  1.8 2.7 2.3 2 2.5 3.5 2.5 3.8 3.8 3.1 ...
#>  $ Eng.Rev     : num  2935 2505 2775 2835 2625 ...
#>  $ Front.Hd    : num  3.5 2 2.5 4 2 3 4 6 5 5.5 ...
#>  $ Frt.Leg.Room: num  41.5 41.5 41.5 42 42 42 42 42 41 41 ...
#>  $ Frt.Shld    : num  53 55.5 56.5 52.5 52 54.5 56.5 58.5 59 58 ...
#>  $ Gear.Ratio  : num  3.26 2.95 3.27 3.25 3.02 2.8 NA NA NA NA ...
#>  $ Gear2       : num  3.21 3.02 3.25 3.25 2.99 2.85 2.84 1.99 1.99 2.33 ...
#>  $ HP          : num  130 160 130 108 168 208 110 165 165 101 ...
#>  $ HP.revs     : num  6000 5900 5500 5300 5800 5700 5200 4800 4800 4400 ...
#>  $ Height      : num  47.5 50 51.5 50.5 49.5 51 49.5 50.5 51 50.5 ...
#>  $ Length      : num  177 191 193 176 175 186 189 197 197 192 ...
#>  $ Luggage     : num  16 14 17 10 12 12 16 16 16 15 ...
#>  $ Mileage     : num  NA 20 NA 27 NA NA 21 NA 23 NA ...
#>  $ Model2      : Factor w/ 21 levels "","      Turbo 4 (3)",..: 1 1 1 1 1 1 1 14 13 1 ...
#>  $ Price       : num  11950 24760 26900 18900 24650 ...
#>  $ Rear.Hd     : num  1.5 2 3 1 1 2.5 2.5 4.5 3.5 3.5 ...
#>  $ Rear.Seating: num  26.5 28.5 31 28 25.5 27 28 30.5 28.5 27.5 ...
#>  $ RearShld    : num  52 55.5 55 52 51.5 55.5 56 58.5 58.5 56.5 ...
#>  $ Reliability : Ord.factor w/ 5 levels "Much worse"<"worse"<..: 5 5 NA NA 4 NA 3 3 3 NA ...
#>  $ Rim         : Factor w/ 6 levels "R12","R13","R14",..: 3 4 4 3 3 4 3 3 3 3 ...
#>  $ Sratio.m    : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ Sratio.p    : num  0.86 0.96 0.97 0.71 0.88 0.78 0.76 0.83 0.87 0.88 ...
#>  $ Steering    : Factor w/ 3 levels "manual","power",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ Tank        : num  13.2 18 21.1 15.9 16.4 21.1 15.7 18 18 16.5 ...
#>  $ Tires       : Factor w/ 30 levels "145","145/80",..: 16 20 20 8 17 28 13 23 23 22 ...
#>  $ Trans1      : Factor w/ 4 levels "","man.4","man.5",..: 3 3 3 3 3 3 1 1 1 1 ...
#>  $ Trans2      : Factor w/ 4 levels "","auto.3","auto.4",..: 3 3 2 2 3 3 2 3 3 3 ...
#>  $ Turning     : num  37 42 39 35 35 39 41 43 42 41 ...
#>  $ Type        : Factor w/ 6 levels "Compact","Large",..: 4 3 3 1 1 3 3 2 2 NA ...
#>  $ Weight      : num  2700 3265 2935 2670 2895 ...
#>  $ Wheel.base  : num  102 109 106 100 101 109 105 111 111 108 ...
#>  $ Width       : num  67 69 71 67 65 69 69 72 72 71 ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM