简体   繁体   English

在 R 中处理分类数据

[英]Dealing with categorical data in R

This is the code that I was using for my data mining assignment in R studio.这是我在 R studio 中用于数据挖掘作业的代码。 I was preprocessing the data.我正在预处理数据。

setwd('C:/Users/user/OneDrive/assignments/Data mining/individual')

dataset = read.csv('Dataset.csv') 
dataset[dataset == '?'] <- NA 
View(dataset)
x <- na.omit(dataset) 
library(tidyr)
library(dplyr)
library(outliers)
View(gather(x))
x$Age[x$Age <= 30] <- 3
x$Age[(x$Age <=49) & (x$Age >= 31)] <- 2 
x$Age[(x$Age != 3) & (x$Age !=2)] <- 1
x$Hours_Per_week[x$Hours_Per_week <= 30] <- 3
x$Hours_Per_week[(x$Hours_Per_week <= 49)& (x$Hours_Per_week >= 31)] <- 2
x$Hours_Per_week[(x$Hours_Per_week != 3) & (x$Hours_Per_week != 2)] <- 1
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov","Local- 
gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov"), labels = 
c(1,2,3,4,5,6) )

And here by I will attach the result of the code.在这里,我将附上代码的结果。 the result结果

str(x)字符串(x)

As you can see in the result , after the last code , all the data in the column Hours_Per_week is suddenly changed into NA.在结果中可以看到,在最后一段代码之后,Hours_Per_week 列中的所有数据突然都变成了 NA。 I don't really know why this occurs since every other example that I saw online changed the data inside to the labels.我真的不知道为什么会发生这种情况,因为我在网上看到的每个其他示例都将里面的数据更改为标签。

The link for the dataset :数据集的链接:

dataset 数据集

不幸的是,我不知道原始数据 - 可能您只需要更改级别和标签内容:

x$Work_Class <- factor(x$Work_Class, levels = c(1,2,3,4,5,6), labels = c("Federal-gov","Local-gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov") )

The problem is the factor() statement.问题是factor()语句。 The Dataset.csv file does not have character strings surrounded by quotation marks so you get a leading space on every character field. Dataset.csv文件没有用引号括起来的字符串,因此您在每个字符字段上都有一个前导空格。

str(dataset)
# data.frame':  100 obs. of  7 variables:
#  $ Age           : int  39 50 38 53 28 37 49 52 31 42 ...
#  $ Work_Class    : chr  " State-gov" " Self-emp-not-inc" " Private" NA ...
#  $ Education     : chr  " Bachelors" " Bachelors" " HS-grad" " 11th" ...
#  $ Marital_Status: chr  " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
#  $ Sex           : chr  " Male" " Male" " Male" " Male" ...
#  $ Hours_Per_week: int  40 13 40 40 40 40 16 45 50 40 ...
#  $ Income        : chr  " <=50K" " <=50K" " <=50K" " <=50K" ...

Notice the blank space before each label in Work_Class , Education , Marital_Status , Sex , and Income .注意Work_ClassEducationMarital_StatusSexIncome每个标签前的空格。 You need to trim the white space when you read the file:读取文件时需要修剪空白:

dataset = read.csv('Dataset.csv', strip.white=TRUE) 

Then change the last line by removing the labels= argument:然后通过删除labels=参数更改最后一行:

x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov", "Local-gov", "Private", "Self-emp-inc", "Self-emp-not-inc", "State-gov"))

str(x)
# 'data.frame': 93 obs. of  7 variables:
#  $ Age           : num  2 1 2 3 2 2 1 2 2 3 ...
#  $ Work_Class    : Factor w/ 6 levels "Federal-gov",..: 6 5 3 3 3 3 5 3 3 6 ...
#  $ Education     : chr  "Bachelors" "Bachelors" "HS-grad" "Bachelors" ...
#  $ Marital_Status: chr  "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
#  $ Sex           : chr  "Male" "Male" "Male" "Female" ...
#  $ Hours_Per_week: num  2 3 2 2 2 3 2 2 1 2 ...
#  $ Income        : chr  "<=50K" "<=50K" "<=50K" "<=50K" ...
#  - attr(*, "na.action")= 'omit' Named int [1:7] 4 9 28 62 70 78 93
#   ..- attr(*, "names")= chr [1:7] "4" "9" "28" "62" ...
table(x$Work_Class)
# 
#      Federal-gov        Local-gov          Private     Self-emp-inc Self-emp-not-inc        State-gov 
#                6                6               67                3                7                4 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM