简体   繁体   中英

feature selection for Naive Bayes

I did a classification with Naive Bayes. The goal is to predict 4 factors through a text. The data looks like this:

 'data.frame':  387 obs. of  2 variables:
 $ reviewText: chr  "I love this. I have a D800. I am mention my camera to make sure that you understand that this product is not ju"| __truncated__ "I hate buying larger gig memory cards - because there's always that greater risk of losing the photos, and/or r"| __truncated__ "These chromebooks are really a pretty nice idea -- Almost no maintaince (no maintaince?), no moving parts, smal"| __truncated__ "Purchased, as this drive allows a much speedier read/write and is just below a full SSD (they need to drop the "| __truncated__ ...
 $ pragmatic : Factor w/ 4 levels "-1","0","1","9": 4 4 4 3 3 4 3 3 3...

I did the classification with the caret package. The code for the classification looks like this:

sms_corpus <- Corpus(VectorSource(sms_raw$text))
sms_corpus_clean <- sms_corpus %>%
    tm_map(content_transformer(tolower)) %>% 
    tm_map(removeNumbers) %>%
    tm_map(removeWords, stopwords(kind="en")) %>%
    tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)

train_index <- createDataPartition(sms_raw$type, p=0.5, list=FALSE)
sms_raw_train <- sms_raw[train_index,]
sms_raw_test <- sms_raw[-train_index,]
sms_corpus_clean_train <- sms_corpus_clean[train_index]
sms_corpus_clean_test <- sms_corpus_clean[-train_index]
sms_dtm_train <- sms_dtm[train_index,]
sms_dtm_test <- sms_dtm[-train_index,]

sms_dict <- findFreqTerms(sms_dtm_train, lowfreq= 5) 
sms_train <- DocumentTermMatrix(sms_corpus_clean_train, list(dictionary=sms_dict))
sms_test <- DocumentTermMatrix(sms_corpus_clean_test, list(dictionary=sms_dict))

convert_counts <- function(x) {
    x <- ifelse(x > 0, 1, 0)
    x <- factor(x, levels = c(0, 1), labels = c("Absent", "Present"))
}
sms_train <- sms_train %>% apply(MARGIN=2, FUN=convert_counts)
sms_test <- sms_test %>% apply(MARGIN=2, FUN=convert_counts)


ctrl <- trainControl(method="cv", 10)
set.seed(8)
sms_model1 <- train(sms_train, sms_raw_train$type, method="nb",
                trControl=ctrl)


sms_predict1 <- predict(sms_model1, sms_test)
cm1 <- confusionMatrix(sms_predict1, sms_raw_test$type)

When I use this model in that way which means the I do the prediction for all of the 4 variables at the same time I get a low Accuracy:0.5469 , The confusion matrix looks like this.

          Reference
Prediction -1  0  1  9
        -1  0  0  1  0
        0   0  0  0  0
        1   9  5 33 25
        9  11  3 33 72

When I do the prediction for all of the 4 variables separately I get a much better result. The code for the classification is the same as above but instead of df$sensorial <- factor(df$sensorial) I do df$sensorial <- as.factor(df$sensorial == 9) . For the other variables, I use 1 , -1 or 0 instead of the 9 . If I do it that way I get an Accuracy: 0.772 for the 9 , an Accuracy:0.829 for the -1 , an Accuracy:0.9016 for the 0 and an Accuracy:0.7959 for the 1 . In Addition, the result is far better. So it must have something to do with feature selection. The reason for the different results might be the features are often the same for the different values. So, a possible solution could be to give those feature more importance which occurs just in presence of a certain value but not in presence of the others. Is there a way to select the features in such a way, so that the model will be better if I do the prediction for all the 4 variables simultaneously? Something like a weighted term-document-matrix?

Edit:

I calculated the weights for the four values like Cihan Ceyhan told:

prop.table(table(sms_raw_train$type))
         -1           0           1           9 
0.025773196 0.005154639 0.180412371 0.788659794 

modelweights <- ifelse(sms_raw_train$type == -1, 
             (1/table(sms_raw_train$type)[1]) * 0.25, 
             ifelse(sms_raw_train$type == 0, 
             (1/table(sms_raw_train$type)[2]) * 0.25,
             ifelse(sms_raw_train$type == 1, 
             (1/table(sms_raw_train$type)[3]) * 0.25,
             ifelse(sms_raw_train$type == 9, 
             (1/table(sms_raw_train$type)[4]) * 0.25,9))))    

But the result is not better Accuracy:0.5677

              Reference
    Prediction -1  0  1  9
            -1  1  0  1  1
            0   1  0  1  0
            1  11  3 32 20
            9   7  5 33 76

So, maybe it´sa better idea to calculate the results for every value separately an then sum the results up like in the second solution that was posted.

Accuracy is a misleading metric to use here. In the multilabel confusion matrix you have posted, you have ~89% accuracy if you only look at label -1 versus others . Because you predict -1 only once, and missclasify -1 's as others 20 times (9+11). For all other cases, you classify the -1 vs others problem correctly, so 170/191=89% accuracy. But of course this doesn't mean the model is working as intended; it is just printing others to almost all cases. This mechanic is also the reason why you see higher accuracy numbers in the single label classifications.

See here for a good overview on the class imbalance problem, and potential ways to mitigate it.

Also this thread is very relevant to your case, so I suggest you take a look.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM