This is the code that I was using for my data mining assignment in R studio. I was preprocessing the data.
setwd('C:/Users/user/OneDrive/assignments/Data mining/individual')
dataset = read.csv('Dataset.csv')
dataset[dataset == '?'] <- NA
View(dataset)
x <- na.omit(dataset)
library(tidyr)
library(dplyr)
library(outliers)
View(gather(x))
x$Age[x$Age <= 30] <- 3
x$Age[(x$Age <=49) & (x$Age >= 31)] <- 2
x$Age[(x$Age != 3) & (x$Age !=2)] <- 1
x$Hours_Per_week[x$Hours_Per_week <= 30] <- 3
x$Hours_Per_week[(x$Hours_Per_week <= 49)& (x$Hours_Per_week >= 31)] <- 2
x$Hours_Per_week[(x$Hours_Per_week != 3) & (x$Hours_Per_week != 2)] <- 1
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov","Local-
gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov"), labels =
c(1,2,3,4,5,6) )
And here by I will attach the result of the code. the result
As you can see in the result , after the last code , all the data in the column Hours_Per_week is suddenly changed into NA. I don't really know why this occurs since every other example that I saw online changed the data inside to the labels.
The link for the dataset :
不幸的是,我不知道原始数据 - 可能您只需要更改级别和标签内容:
x$Work_Class <- factor(x$Work_Class, levels = c(1,2,3,4,5,6), labels = c("Federal-gov","Local-gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov") )
The problem is the factor()
statement. The Dataset.csv
file does not have character strings surrounded by quotation marks so you get a leading space on every character field.
str(dataset)
# data.frame': 100 obs. of 7 variables:
# $ Age : int 39 50 38 53 28 37 49 52 31 42 ...
# $ Work_Class : chr " State-gov" " Self-emp-not-inc" " Private" NA ...
# $ Education : chr " Bachelors" " Bachelors" " HS-grad" " 11th" ...
# $ Marital_Status: chr " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
# $ Sex : chr " Male" " Male" " Male" " Male" ...
# $ Hours_Per_week: int 40 13 40 40 40 40 16 45 50 40 ...
# $ Income : chr " <=50K" " <=50K" " <=50K" " <=50K" ...
Notice the blank space before each label in Work_Class
, Education
, Marital_Status
, Sex
, and Income
. You need to trim the white space when you read the file:
dataset = read.csv('Dataset.csv', strip.white=TRUE)
Then change the last line by removing the labels=
argument:
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov", "Local-gov", "Private", "Self-emp-inc", "Self-emp-not-inc", "State-gov"))
str(x)
# 'data.frame': 93 obs. of 7 variables:
# $ Age : num 2 1 2 3 2 2 1 2 2 3 ...
# $ Work_Class : Factor w/ 6 levels "Federal-gov",..: 6 5 3 3 3 3 5 3 3 6 ...
# $ Education : chr "Bachelors" "Bachelors" "HS-grad" "Bachelors" ...
# $ Marital_Status: chr "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
# $ Sex : chr "Male" "Male" "Male" "Female" ...
# $ Hours_Per_week: num 2 3 2 2 2 3 2 2 1 2 ...
# $ Income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
# - attr(*, "na.action")= 'omit' Named int [1:7] 4 9 28 62 70 78 93
# ..- attr(*, "names")= chr [1:7] "4" "9" "28" "62" ...
table(x$Work_Class)
#
# Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc State-gov
# 6 6 67 3 7 4
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.