简体   繁体   中英

Titanic Kaggle dataset Naive Bayes classifier error R programming

I am trying to train a naive bayes classifier for the Kaggle - Titanic dataset (URL- https://www.kaggle.com/c/titanic/data for "train.csv" and "test.csv").

The code that I have come up with so far is as follows-

library(e1071)

train_d <- read.csv("train.csv", stringsAsFactors = TRUE)

# columns chosen for training data-
# colnames(TD)  OR names(TD)
# "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch","Embarked"
train_data <- train_d[, c(2:3, 5:8, 12)]

# to find out which columns contain NA (missing values)-
colnames(train_data)[apply(is.na(train_data), 2, any)]

# mean(TD$age, na.rm = TRUE)    # to find mean of 'age' which contains 'NA'
# which(is.na(age))

# fill in missing value (NA) with mean of 'Age' column-
train_data$Age[which(is.na(train_data$Age))] <- mean(train_data$Age, na.rm = TRUE)

# check whether there are any existing NAs-
which(is.na(train_data$Age))
# OR-
colnames(train_data)[apply(is.na(train_data), 2, any)]


test_d <- read.csv("test.csv", stringsAsFactors = TRUE)

# columns chosen for training data-
# "Pclass", "Sex", "Age", "SibSp", "Parch", "Embarked"
test_data <- test_d[, c(2, 4:7, 11)]

# find out missing values (NA)-
colnames(test_data)[apply(is.na(test_data), 2, any)]

# fill in missing value (NA) with mean of 'Age' column-
test_data$Age[which(is.na(test_data$Age))] <- mean(test_data$Age, na.rm = TRUE)

# check whether there are any existing NAs-
which(is.na(train_data$Age))
# OR-
colnames(train_data)[apply(is.na(train_data), 2, any)]




# training a naive-bayes classifier-
titanic_nb <- naiveBayes(Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked, data = train_data)


# predict using trained naive-bayes classifier-
output <- predict(titanic_nb, test_data, type = "class")

However, 'output' doesn't really contain anything. Output of 'output' variable is-

> output
factor(0)
Levels: 

What's going wrong?

Thanks!

Here is the answer : original question deleted, so web-cache link.

The reason is that the model doesn't REALLY know how to deal with character columns, as you can see if you run data.matrix(test_data) .

The solution is to first convert your character columns into factors, ensuring that the factor levels in both train and test are consistent.

On a side note, I suggest starting with Random Forest, as it generally performs well without any parameter tuning, and doesn't care about the distribution of your variables (as opposed to NB which assumes Gaussian distributions).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM