简体   繁体   English

朴素贝叶斯的功能选择

[英]feature selection for Naive Bayes

I did a classification with Naive Bayes. 我与朴素贝叶斯进行了分类。 The goal is to predict 4 factors through a text. 目标是通过文本预测4个因素。 The data looks like this: 数据如下所示:

 'data.frame':  387 obs. of  2 variables:
 $ reviewText: chr  "I love this. I have a D800. I am mention my camera to make sure that you understand that this product is not ju"| __truncated__ "I hate buying larger gig memory cards - because there's always that greater risk of losing the photos, and/or r"| __truncated__ "These chromebooks are really a pretty nice idea -- Almost no maintaince (no maintaince?), no moving parts, smal"| __truncated__ "Purchased, as this drive allows a much speedier read/write and is just below a full SSD (they need to drop the "| __truncated__ ...
 $ pragmatic : Factor w/ 4 levels "-1","0","1","9": 4 4 4 3 3 4 3 3 3...

I did the classification with the caret package. 我用caret包进行了分类。 The code for the classification looks like this: 分类的代码如下所示:

sms_corpus <- Corpus(VectorSource(sms_raw$text))
sms_corpus_clean <- sms_corpus %>%
    tm_map(content_transformer(tolower)) %>% 
    tm_map(removeNumbers) %>%
    tm_map(removeWords, stopwords(kind="en")) %>%
    tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)

train_index <- createDataPartition(sms_raw$type, p=0.5, list=FALSE)
sms_raw_train <- sms_raw[train_index,]
sms_raw_test <- sms_raw[-train_index,]
sms_corpus_clean_train <- sms_corpus_clean[train_index]
sms_corpus_clean_test <- sms_corpus_clean[-train_index]
sms_dtm_train <- sms_dtm[train_index,]
sms_dtm_test <- sms_dtm[-train_index,]

sms_dict <- findFreqTerms(sms_dtm_train, lowfreq= 5) 
sms_train <- DocumentTermMatrix(sms_corpus_clean_train, list(dictionary=sms_dict))
sms_test <- DocumentTermMatrix(sms_corpus_clean_test, list(dictionary=sms_dict))

convert_counts <- function(x) {
    x <- ifelse(x > 0, 1, 0)
    x <- factor(x, levels = c(0, 1), labels = c("Absent", "Present"))
}
sms_train <- sms_train %>% apply(MARGIN=2, FUN=convert_counts)
sms_test <- sms_test %>% apply(MARGIN=2, FUN=convert_counts)


ctrl <- trainControl(method="cv", 10)
set.seed(8)
sms_model1 <- train(sms_train, sms_raw_train$type, method="nb",
                trControl=ctrl)


sms_predict1 <- predict(sms_model1, sms_test)
cm1 <- confusionMatrix(sms_predict1, sms_raw_test$type)

When I use this model in that way which means the I do the prediction for all of the 4 variables at the same time I get a low Accuracy:0.5469 , The confusion matrix looks like this. 当我以这种方式使用该模型时,这意味着我同时对所有4个变量进行了预测,但得到的Accuracy:0.5469很低Accuracy:0.5469 ,混淆矩阵如下所示。

          Reference
Prediction -1  0  1  9
        -1  0  0  1  0
        0   0  0  0  0
        1   9  5 33 25
        9  11  3 33 72

When I do the prediction for all of the 4 variables separately I get a much better result. 当我分别对所有4个变量进行预测时,可以获得更好的结果。 The code for the classification is the same as above but instead of df$sensorial <- factor(df$sensorial) I do df$sensorial <- as.factor(df$sensorial == 9) . 分类的代码与上面相同,但是我不是df$sensorial <- factor(df$sensorial) df$sensorial <- as.factor(df$sensorial == 9) df$sensorial <- factor(df$sensorial)而是df$sensorial <- as.factor(df$sensorial == 9) For the other variables, I use 1 , -1 or 0 instead of the 9 . 对于其他变量,我使用1-10而不是9 If I do it that way I get an Accuracy: 0.772 for the 9 , an Accuracy:0.829 for the -1 , an Accuracy:0.9016 for the 0 and an Accuracy:0.7959 for the 1 . 如果以这种方式进行操作,我将获得Accuracy: 0.772 (代表9Accuracy:0.829 (代表-1Accuracy:0.9016 (代表0Accuracy:0.7959 (代表1 In Addition, the result is far better. 此外,结果要好得多。 So it must have something to do with feature selection. 因此,它必须与特征选择有关。 The reason for the different results might be the features are often the same for the different values. 结果不同的原因可能是特征对于不同的值通常是相同的。 So, a possible solution could be to give those feature more importance which occurs just in presence of a certain value but not in presence of the others. 因此,一种可能的解决方案是赋予这些功能更高的重要性,这些功能仅在存在某个值时才出现,而在其他值不存在时才出现。 Is there a way to select the features in such a way, so that the model will be better if I do the prediction for all the 4 variables simultaneously? 有没有办法以这种方式选择特征,以便如果同时对所有4个变量进行预测,模型会更好? Something like a weighted term-document-matrix? 像加权术语文档矩阵之类的东西?

Edit: 编辑:

I calculated the weights for the four values like Cihan Ceyhan told: 我计算了Cihan Ceyhan告诉的四个值的权重:

prop.table(table(sms_raw_train$type))
         -1           0           1           9 
0.025773196 0.005154639 0.180412371 0.788659794 

modelweights <- ifelse(sms_raw_train$type == -1, 
             (1/table(sms_raw_train$type)[1]) * 0.25, 
             ifelse(sms_raw_train$type == 0, 
             (1/table(sms_raw_train$type)[2]) * 0.25,
             ifelse(sms_raw_train$type == 1, 
             (1/table(sms_raw_train$type)[3]) * 0.25,
             ifelse(sms_raw_train$type == 9, 
             (1/table(sms_raw_train$type)[4]) * 0.25,9))))    

But the result is not better Accuracy:0.5677 但是结果却不是更好Accuracy:0.5677

              Reference
    Prediction -1  0  1  9
            -1  1  0  1  1
            0   1  0  1  0
            1  11  3 32 20
            9   7  5 33 76

So, maybe it´sa better idea to calculate the results for every value separately an then sum the results up like in the second solution that was posted. 因此,最好是分别计算每个值的结果,然后像发布的第二个解决方案一样对结果求和。

Accuracy is a misleading metric to use here. 准确性是在此处使用的一种误导性指标。 In the multilabel confusion matrix you have posted, you have ~89% accuracy if you only look at label -1 versus others . 在您发布的多标签混淆矩阵中,如果只看标签-1others标签相比,您的准确性约为89%。 Because you predict -1 only once, and missclasify -1 's as others 20 times (9+11). 因为您只预测-1一次,而将-1误分类为others 20次(9 + 11)。 For all other cases, you classify the -1 vs others problem correctly, so 170/191=89% accuracy. 对于所有其他情况,您可以正确地将-1 vs. others问题分类,因此精度为170/191=89% But of course this doesn't mean the model is working as intended; 但是,这当然并不意味着该模型可以按预期工作。 it is just printing others to almost all cases. 它只是在几乎所有情况下打印others This mechanic is also the reason why you see higher accuracy numbers in the single label classifications. 这也是为什么您在单标签分类中看到更高准确度数字的原因。

See here for a good overview on the class imbalance problem, and potential ways to mitigate it. 请参阅此处,以获得有关类不平衡问题以及缓解它的潜在方法的良好概述。

Also this thread is very relevant to your case, so I suggest you take a look. 另外,该线程与您的情况非常相关,因此我建议您看一下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM