[英]Dealing with categorical data in R
This is the code that I was using for my data mining assignment in R studio.这是我在 R studio 中用于数据挖掘作业的代码。 I was preprocessing the data.
我正在预处理数据。
setwd('C:/Users/user/OneDrive/assignments/Data mining/individual')
dataset = read.csv('Dataset.csv')
dataset[dataset == '?'] <- NA
View(dataset)
x <- na.omit(dataset)
library(tidyr)
library(dplyr)
library(outliers)
View(gather(x))
x$Age[x$Age <= 30] <- 3
x$Age[(x$Age <=49) & (x$Age >= 31)] <- 2
x$Age[(x$Age != 3) & (x$Age !=2)] <- 1
x$Hours_Per_week[x$Hours_Per_week <= 30] <- 3
x$Hours_Per_week[(x$Hours_Per_week <= 49)& (x$Hours_Per_week >= 31)] <- 2
x$Hours_Per_week[(x$Hours_Per_week != 3) & (x$Hours_Per_week != 2)] <- 1
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov","Local-
gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov"), labels =
c(1,2,3,4,5,6) )
And here by I will attach the result of the code.在这里,我将附上代码的结果。 the result
结果
As you can see in the result , after the last code , all the data in the column Hours_Per_week is suddenly changed into NA.在结果中可以看到,在最后一段代码之后,Hours_Per_week 列中的所有数据突然都变成了 NA。 I don't really know why this occurs since every other example that I saw online changed the data inside to the labels.
我真的不知道为什么会发生这种情况,因为我在网上看到的每个其他示例都将里面的数据更改为标签。
The link for the dataset :数据集的链接:
不幸的是,我不知道原始数据 - 可能您只需要更改级别和标签内容:
x$Work_Class <- factor(x$Work_Class, levels = c(1,2,3,4,5,6), labels = c("Federal-gov","Local-gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov") )
The problem is the factor()
statement.问题是
factor()
语句。 The Dataset.csv
file does not have character strings surrounded by quotation marks so you get a leading space on every character field. Dataset.csv
文件没有用引号括起来的字符串,因此您在每个字符字段上都有一个前导空格。
str(dataset)
# data.frame': 100 obs. of 7 variables:
# $ Age : int 39 50 38 53 28 37 49 52 31 42 ...
# $ Work_Class : chr " State-gov" " Self-emp-not-inc" " Private" NA ...
# $ Education : chr " Bachelors" " Bachelors" " HS-grad" " 11th" ...
# $ Marital_Status: chr " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
# $ Sex : chr " Male" " Male" " Male" " Male" ...
# $ Hours_Per_week: int 40 13 40 40 40 40 16 45 50 40 ...
# $ Income : chr " <=50K" " <=50K" " <=50K" " <=50K" ...
Notice the blank space before each label in Work_Class
, Education
, Marital_Status
, Sex
, and Income
.注意
Work_Class
、 Education
、 Marital_Status
、 Sex
和Income
每个标签前的空格。 You need to trim the white space when you read the file:读取文件时需要修剪空白:
dataset = read.csv('Dataset.csv', strip.white=TRUE)
Then change the last line by removing the labels=
argument:然后通过删除
labels=
参数更改最后一行:
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov", "Local-gov", "Private", "Self-emp-inc", "Self-emp-not-inc", "State-gov"))
str(x)
# 'data.frame': 93 obs. of 7 variables:
# $ Age : num 2 1 2 3 2 2 1 2 2 3 ...
# $ Work_Class : Factor w/ 6 levels "Federal-gov",..: 6 5 3 3 3 3 5 3 3 6 ...
# $ Education : chr "Bachelors" "Bachelors" "HS-grad" "Bachelors" ...
# $ Marital_Status: chr "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
# $ Sex : chr "Male" "Male" "Male" "Female" ...
# $ Hours_Per_week: num 2 3 2 2 2 3 2 2 1 2 ...
# $ Income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
# - attr(*, "na.action")= 'omit' Named int [1:7] 4 9 28 62 70 78 93
# ..- attr(*, "names")= chr [1:7] "4" "9" "28" "62" ...
table(x$Work_Class)
#
# Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc State-gov
# 6 6 67 3 7 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.