簡體   English   中英

在 R 中訓練朴素貝葉斯 model 時出現問題

[英]Problem when training Naive Bayes model in R

我正在使用 Caret package(沒有太多使用 Caret 的經驗)來使用 Naive Bayes 訓練我的數據,如下面的 R 代碼中所述。 我在執行“nb_model”時遇到包含句子的問題,因為它會產生一系列錯誤消息,它們是:

1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in 
predict.NaiveBayes(modelFit, newdata) : 
Not all variable names used in object found in newdata

2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in 
NaiveBayes.default(x, y, usekernel = FALSE, fL = param$fL, ...) : 

請您就如何調整下面的 R 代碼來克服該問題提出建議嗎?

下面的 R 代碼中使用的數據集

數據集外觀的快速示例(10 個變量):

  Over arrested at in | Negative | Negative | Neutral | Neutral | Neutral | Negative |
  Positive | Neutral | Negative
library(caret)

# Loading dataset
setwd("directory/path")
TrainSet = read.csv("textsent.csv", header = FALSE)

# Specifying an 80-20 train-test split
# Creating the training and testing sets
train = TrainSet[1:1200, ]
test = TrainSet[1201:1500, ]

# Declaring the trainControl function
train_ctrl = trainControl(
  method  = "cv", #Specifying Cross validation
  number  = 3, # Specifying 3-fold
)

nb_model = train(
  V10 ~., # Specifying the response variable and the feature variables
  method = "nb", # Specifying the model to use
  data = train, 
  trControl = train_ctrl,
)

# Get the predictions of your model in the test set
predictions = predict(nb_model, newdata = test)

# See the confusion matrix of your model in the test set
confusionMatrix(predictions, test$V10)

數據集都是字符數據。 在該數據中,有易於編碼的單詞 ( V2 - V10 ) 和句子的組合,您可以對其進行任意數量的特征工程並生成任意數量的特征。

要了解文本挖掘,請查看tm package、其文檔或hack-r.com等博客以獲取實際示例。 這是鏈接文章中的一些Github 代碼

好的,所以首先我設置stringsAsFactors = F因為你的V1有大量獨特的句子

TrainSet <- read.csv(url("https://raw.githubusercontent.com/jcool12/dataset/master/textsentiment.csv?token=AA4LAP5VXI6I7FRKMT6HDPK6U5XBY"),
                     header = F,
                     stringsAsFactors = F)

library(caret)

然后我做了特征工程

## Feature Engineering
# V2 - V10
TrainSet[TrainSet=="Negative"] <- 0
TrainSet[TrainSet=="Positive"] <- 1

# V1 - not sure what you wanted to do with this
#     but here's a simple example of what 
#     you could do
TrainSet$V1 <- grepl("london", TrainSet$V1) # tests if london is in the string

然后它起作用了,盡管您需要改進V1的工程(或放棄它)以獲得更好的結果。

# In reality you could probably generate 20+ decent features from this text
#  word count, tons of stuff... see the tm package

# Specifying an 80-20 train-test split
# Creating the training and testing sets
train = TrainSet[1:1200, ]
test = TrainSet[1201:1500, ]

# Declaring the trainControl function
train_ctrl = trainControl(
  method  = "cv", # Specifying Cross validation
  number  = 3,    # Specifying 3-fold
)

nb_model = train(
  V10 ~., # Specifying the response variable and the feature variables
  method = "nb", # Specifying the model to use
  data = train, 
  trControl = train_ctrl,
)

# Resampling: Cross-Validated (3 fold) 
# Summary of sample sizes: 799, 800, 801 
# Resampling results across tuning parameters:
#   
#   usekernel  Accuracy   Kappa    
# FALSE      0.6533444  0.4422346
# TRUE      0.6633569  0.4185751

你會在這個基本示例中得到一些可忽略的警告,因為V1中只有很少的句子包含“london”這個詞。 我建議將該列用於情緒分析、詞頻/反向文檔頻率等。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM