如何將朴素貝葉斯模型應用於新數據

Question

我今天早上就這個問題提出了一個問題，但我刪除了這個問題，並用更好的措辭在這里張貼。

我使用訓練和測試數據創建了我的第一個機器學習模型。 我返回了一個混淆矩陣並看到了一些匯總統計數據。

我現在想將該模型應用於新數據以進行預測，但我不知道如何進行。

上下文：預測每月的“流失”取消。 目標變量是“攪動”的，它有兩個可能的標簽“攪動”和“未攪動”。

    head(tdata)
  months_subscription nvk_medium                                org_type     churned
1                  25       none                               Community not churned
2                   7       none                            Sports clubs not churned
3                  28       none                            Sports clubs not churned
4                  18    unknown Religious congregations and communities not churned
5                  15       none              Association - Professional not churned
6                   9       none              Association - Professional not churned

這是我的訓練和測試：

 library("klaR")
 library("caret")

# import data
test_data_imp <- read.csv("tdata.csv")

# subset only required vars
# had to remove "revenue" since all churned records are 0 (need last price point)
variables <- c("months_subscription", "nvk_medium", "org_type", "churned")
tdata <- test_data_imp[variables]

#training
rn_train <- sample(nrow(tdata),
                   floor(nrow(tdata)*0.75))
train <- tdata[rn_train,]
test <- tdata[-rn_train,]
model <- NaiveBayes(churned ~., data=train)

# testing
predictions <- predict(model, test)
confusionMatrix(test$churned, predictions$class)

到這里為止一切正常。

現在我有了新的數據，結構和布局與上面的 tdata 相同。 如何將我的模型應用於這些新數據以進行預測？ 直覺上，我正在尋找一個新的 cbinded 列，該列具有每個記錄的預測類。

我試過這個：

## prediction ##
# import data
data_imp <- read.csv("pdata.csv")
pdata <- data_imp[variables]

actual_predictions <- predict(model, pdata)

#append to data and output (as head by default)
predicted_data <- cbind(pdata, actual_predictions$class)

# output
head(predicted_data)

哪個拋出錯誤

actual_predictions <- predict(model, pdata)
Error in object$tables[[v]][, nd] : subscript out of bounds
In addition: Warning messages:
1: In FUN(1:6433[[4L]], ...) :
  Numerical 0 probability for all classes with observation 1
2: In FUN(1:6433[[4L]], ...) :
  Numerical 0 probability for all classes with observation 2
3: In FUN(1:6433[[4L]], ...) :
  Numerical 0 probability for all classes with observation 3

如何將我的模型應用於新數據？ 我想要一個帶有具有預測類的新列的新數據框？

** 在評論之后，這里是用於預測的新數據的 head 和 str **

head(pdata)
  months_subscription nvk_medium                                org_type     churned
1                  26       none                               Community not churned
2                   8       none                            Sports clubs not churned
3                  30       none                            Sports clubs not churned
4                  19    unknown Religious congregations and communities not churned
5                  16       none              Association - Professional not churned
6                  10       none              Association - Professional not churned
> str(pdata)
'data.frame':   6433 obs. of  4 variables:
 $ months_subscription: int  26 8 30 19 16 10 3 5 14 2 ...
 $ nvk_medium         : Factor w/ 16 levels "cloned","CommunityIcon",..: 9 9 9 16 9 9 9 3 12 9 ...
 $ org_type           : Factor w/ 21 levels "Advocacy and civic activism",..: 8 18 18 14 6 6 11 19 6 8 ...
 $ churned            : Factor w/ 1 level "not churned": 1 1 1 1 1 1 1 1 1 1 ...

Answer 1

這很可能是由於訓練數據中的因子編碼（在您的情況下為變量tdata ）與predict函數中使用的新數據（變量pdata ）不匹配造成的，通常是因為您在測試數據中具有因子級別不存在於訓練數據中。 特征編碼的一致性必須由您強制執行，因為predict函數不會檢查它。 因此，我建議您仔細檢查兩個變量中特征nvk_medium和org_type的級別。

錯誤信息：

 Error in object$tables[[v]][, nd] : subscript out of bounds

在評估數據點中的給定特征（第v個特征）時引發，其中nd是與該特征對應的因子的數值。 您還有警告，表明數據點（“觀察”）1、2 和 3 中所有案例的后驗概率都為零，但尚不清楚這是否也與因素的編碼有關。 .

要重現您看到的錯誤，請考慮以下玩具數據（來自http://amunategui.github.io/binary-outcome-modeling/ ），它具有一組與您的數據有些相似的功能：

# Data setup
# From http://amunategui.github.io/binary-outcome-modeling/
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt', sep='\t')
titanicDF$Title <- as.factor(ifelse(grepl('Mr ',titanicDF$Name),'Mr',ifelse(grepl('Mrs ',titanicDF$Name),'Mrs',ifelse(grepl('Miss',titanicDF$Name),'Miss','Nothing'))) )
titanicDF$Age[is.na(titanicDF$Age)] <- median(titanicDF$Age, na.rm=T)
titanicDF$Survived <- as.factor(titanicDF$Survived)
titanicDF <- titanicDF[c('PClass', 'Age',    'Sex',   'Title', 'Survived')]

# Separate into training and test data
inds_train <- sample(1:nrow(titanicDF), round(0.5 * nrow(titanicDF)), replace = FALSE)
Data_train <- titanicDF[inds_train, , drop = FALSE]
Data_test <- titanicDF[-inds_train, , drop = FALSE]

和：

> str(Data_train)

'data.frame':   656 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 3 3 3 1 1 3 3 3 3 ...
$ Age     : num  35 28 34 28 29 28 28 28 45 28 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 1 2 ...
$ Title   : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 2 2 1 2 4 3 2 3 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 2 1 ...

> str(Data_test)

'data.frame':   657 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age     : num  47 63 39 58 19 28 50 37 25 39 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title   : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 1 2 3 3 2 3 2 2 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...

然后一切都按預期進行：

model <- NaiveBayes(Survived ~ ., data = Data_train)

# This will work
pred_1 <- predict(model, Data_test)

> str(pred_1)
List of 2
$ class    : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 1 ...
..- attr(*, "names")= chr [1:657] "6" "7" "8" "9" ...
$ posterior: num [1:657, 1:2] 0.8352 0.0216 0.8683 0.0204 0.0435 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:657] "6" "7" "8" "9" ...
.. ..$ : chr [1:2] "0" "1"

但是，如果編碼不一致，例如：

# Mess things up, by "displacing" the factor values (i.e., 'Nothing' 
# will now be encoded as number 5, which was not present in the 
# training data)
Data_test_2 <- Data_test
Data_test_2$Title <- factor(
    as.character(Data_test_2$Title), 
    levels = c("Dr", "Miss", "Mr", "Mrs", "Nothing")
)

> str(Data_test_2)

'data.frame':   657 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age     : num  47 63 39 58 19 28 50 37 25 39 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title   : Factor w/ 5 levels "Dr","Miss","Mr",..: 3 2 3 4 4 3 4 3 3 3 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...

然后：

> pred_2 <- predict(model, Data_test_2)
Error in object$tables[[v]][, nd] : subscript out of bounds

如何將朴素貝葉斯模型應用於新數據

問題描述

1 個解決方案

解決方案1
1 已采納 2015-10-06 08:16:40

如何將朴素貝葉斯模型應用於新數據

問題描述

1 個解決方案

解決方案1 1 已采納 2015-10-06 08:16:40

解決方案1
1 已采納 2015-10-06 08:16:40