简体   繁体   English

R 分类树问题

[英]R Problems with classification tree

I'm relativly new to machine learning, but I want to create a classification tree based on a world bank dataset.我对机器学习比较陌生,但我想创建一个基于世界银行数据集的分类树。 The classification tree must meet 3 characteristics: continent=Europe , currency=euro , income group = high income分类树必须满足 3 个特征: continent=Europecurrency=euroincome group = high income

This graphic should look like this:此图形应如下所示:

在此处输入图像描述

I already tried this, but it does not give me the output that I want:我已经尝试过了,但它没有给我我想要的 output:

library(tidyverse)
library(rpart)

WDICountry <- read.csv("https://gigamove.rz.rwth-aachen.de/d/id/pUKMStHbu9orYo/dd/100")

tree1 <- WDICountry %>%
  mutate(europe = ifelse(`2-alpha code` %in% european_countries, TRUE, FALSE),
         euro = ifelse(`Currency Unit` == "Euro", TRUE, FALSE),
         income = as.factor(ifelse(`Income Group` == "  High income", "High income", "non-high")))%>%
  mutate(`Income Group` = as.factor(`Income Group`))%>%
  select(`Income Group`, europe, euro)%>%
  filter(complete.cases(.))%>%
  rpart(data = .,formula = `Income Group` ~ europe+ euro)

plot(tree1)
text(tree1)

在此处输入图像描述

Can someone help me?有人能帮我吗?

You can download this dataframe as a csv file here: https://gigamove.rz.rwth-aachen.de/d/id/pUKMStHbu9orYo?10&id=pUKMStHbu9orYo You can download this dataframe as a csv file here: https://gigamove.rz.rwth-aachen.de/d/id/pUKMStHbu9orYo?10&id=pUKMStHbu9orYo

A decision tree is a model to classify, so if there is insufficient evidence to support the splitting of the variable in classification, the split is not performed.决策树是一个 model 来分类,所以如果没有足够的证据支持分类中变量的分裂,则不进行分裂。 Hence when you plot, not all your variables or possible splits are used.因此,当您使用 plot 时,并非使用所有变量或可能的拆分。

Below I will tweak some of the parameters to get all the splits to happen, but note this is most likely not the way to construct a decision tree model.下面我将调整一些参数以使所有拆分发生,但请注意这很可能不是构建决策树 model 的方法。

First, to get the data.首先,获取数据。 Note you don't need a ifelse to set a boolean, I have a lot of problems with the variable names, so below is something with the corrected column names, using your csv file:请注意,您不需要 ifelse 来设置 boolean,我对变量名有很多问题,所以下面是使用 csv 文件更正后的列名:

library(rpart)
library(rpart.plot)
library(dplyr)

WDICountry = read.csv("WDICountry.csv",stringsAsFactors=FALSE)

european_countries = WDICountry[grep("Europe",WDICountry$Region),"X2.alpha.code"]

dat = WDICountry%>%
mutate( europe = X2.alpha.code  %in% european_countries,
       euro= Currency.Unit =="Euro",
       income=as.factor(ifelse(Income.Group =="High income","High income","non-high")))%>%
select(income,europe,euro)%>%
filter(complete.cases(.))

Before fitting the model, see the smallest split:在安装 model 之前,请查看最小拆分:

table(dat$europe,dat$euro)
       
        FALSE TRUE
  FALSE   203    2
  TRUE     35   23

So you need to set the minimum split at the lowest to ensure it splits, and also we set the complexity parameter such that the split proceeds:因此,您需要将最小拆分设置为最低以确保其拆分,并且我们还设置了复杂度参数,以便拆分进行:

mdl = rpart(income ~ europe+euro,data = dat,minsplit=2,method="class",cp=-1)
rpart.plot(mdl)

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM