Dealing with categorical data in R

Question

This is the code that I was using for my data mining assignment in R studio. I was preprocessing the data.

setwd('C:/Users/user/OneDrive/assignments/Data mining/individual')

dataset = read.csv('Dataset.csv') 
dataset[dataset == '?'] <- NA 
View(dataset)
x <- na.omit(dataset) 
library(tidyr)
library(dplyr)
library(outliers)
View(gather(x))
x$Age[x$Age <= 30] <- 3
x$Age[(x$Age <=49) & (x$Age >= 31)] <- 2 
x$Age[(x$Age != 3) & (x$Age !=2)] <- 1
x$Hours_Per_week[x$Hours_Per_week <= 30] <- 3
x$Hours_Per_week[(x$Hours_Per_week <= 49)& (x$Hours_Per_week >= 31)] <- 2
x$Hours_Per_week[(x$Hours_Per_week != 3) & (x$Hours_Per_week != 2)] <- 1
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov","Local- 
gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov"), labels = 
c(1,2,3,4,5,6) )

And here by I will attach the result of the code. the result

str(x)

As you can see in the result , after the last code , all the data in the column Hours_Per_week is suddenly changed into NA. I don't really know why this occurs since every other example that I saw online changed the data inside to the labels.

The link for the dataset :

dataset

Answer 1

不幸的是，我不知道原始数据 - 可能您只需要更改级别和标签内容：

x$Work_Class <- factor(x$Work_Class, levels = c(1,2,3,4,5,6), labels = c("Federal-gov","Local-gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov") )

Answer 2

The problem is the factor() statement. The Dataset.csv file does not have character strings surrounded by quotation marks so you get a leading space on every character field.

str(dataset)
# data.frame':  100 obs. of  7 variables:
#  $ Age           : int  39 50 38 53 28 37 49 52 31 42 ...
#  $ Work_Class    : chr  " State-gov" " Self-emp-not-inc" " Private" NA ...
#  $ Education     : chr  " Bachelors" " Bachelors" " HS-grad" " 11th" ...
#  $ Marital_Status: chr  " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
#  $ Sex           : chr  " Male" " Male" " Male" " Male" ...
#  $ Hours_Per_week: int  40 13 40 40 40 40 16 45 50 40 ...
#  $ Income        : chr  " <=50K" " <=50K" " <=50K" " <=50K" ...

Notice the blank space before each label in Work_Class , Education , Marital_Status , Sex , and Income . You need to trim the white space when you read the file:

dataset = read.csv('Dataset.csv', strip.white=TRUE)

Then change the last line by removing the labels= argument:

x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov", "Local-gov", "Private", "Self-emp-inc", "Self-emp-not-inc", "State-gov"))

str(x)
# 'data.frame': 93 obs. of  7 variables:
#  $ Age           : num  2 1 2 3 2 2 1 2 2 3 ...
#  $ Work_Class    : Factor w/ 6 levels "Federal-gov",..: 6 5 3 3 3 3 5 3 3 6 ...
#  $ Education     : chr  "Bachelors" "Bachelors" "HS-grad" "Bachelors" ...
#  $ Marital_Status: chr  "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
#  $ Sex           : chr  "Male" "Male" "Male" "Female" ...
#  $ Hours_Per_week: num  2 3 2 2 2 3 2 2 1 2 ...
#  $ Income        : chr  "<=50K" "<=50K" "<=50K" "<=50K" ...
#  - attr(*, "na.action")= 'omit' Named int [1:7] 4 9 28 62 70 78 93
#   ..- attr(*, "names")= chr [1:7] "4" "9" "28" "62" ...
table(x$Work_Class)
# 
#      Federal-gov        Local-gov          Private     Self-emp-inc Self-emp-not-inc        State-gov 
#                6                6               67                3                7                4

Dealing with categorical data in R

Question

2 answers

solution1
0 2020-11-02 17:27:57

solution2
0 2020-11-02 23:08:31

Dealing with categorical data in R

Question

2 answers

solution1 0 2020-11-02 17:27:57

solution2 0 2020-11-02 23:08:31

solution1
0 2020-11-02 17:27:57

solution2
0 2020-11-02 23:08:31