簡體   English   中英

naiveBayes使用單詞矩陣和3 +類進行預測

[英]naiveBayes using a word matrix and 3+ classes for prediction

我很難理解A)naiveBayes的輸出和B)naiveBayes的predict()函數。

這不是我的數據,但這是一個有趣的例子,說明我正在嘗試做的事情以及我得到的錯誤:

require(RTextTools)
require(useful)

script <- data.frame(lines=c("Rufus, Brint, and Meekus were like brothers to me. And when I say brother, I don't mean, like, an actual brother, but I mean it like the way black people use it. Which is more meaningful I think","If there is anything that this horrible tragedy can teach us, it's that a male model's life is a precious, precious commodity. Just because we have chiseled abs and stunning features, it doesn't mean that we too can't not die in a freak gasoline fight accident",
                         "Why do you hate models, Matilda","What is this? A center for ants? How can we be expected to teach children to learn how to read... if they can't even fit inside the building?","Look, I think I know what this is about and I'm complimented but not interested.",
                         "Hi Derek! My name's Little Cletus and I'm here to tell you a few things about child labor laws, ok? They're silly and outdated. Why back in the 30s, children as young as five could work as they pleased; from textile factories to iron smelts. Yippee! Hurray!","Todd, are you not aware that I get farty and bloated with a foamy latte?","Oh, I'm sorry, did my pin get in the way of your ass? Do me a favor and lose five pounds immediately or get out of my building like now!",
                         "It's that damn Hansel! He's so hot right now!","Obey my dog!",
                         "I hear words like beauty and handsomness and incredibly chiseled features and for me that's like a vanity of self absorption that I try to steer clear of.","Yeah, you're cool to hide here, but first me and him got to straighten some shit out.",
                         "I wasn't like every other kid, you know, who dreams about being an astronaut, I was always more interested in what bark was made out of on a tree. Richard Gere's a real hero of mine. Sting. Sting would be another person who's a hero. The music he's created over the years, I don't really listen to it, but the fact that he's making it, I respect that. I care desperately about what I do. Do I know what product I'm selling? No. Do I know what I'm doing today? No. But I'm here, and I'm gonna give it my best shot.","I totally agree with you. But how do you feel about male models?",
                         "So I'm rappelling down Mount Vesuvius when suddenly I slip, and I start to fall. Just falling, ahh ahh, I'll never forget the terror. When suddenly I realize Holy shit, Hansel, haven't you been smoking Peyote for six straight days, and couldn't some of this maybe be in your head?"))

people <- as.factor(c("Zoolander","Zoolander","Zoolander","Zoolander","Zoolander",
                         "Mugatu","Mugatu","Mugatu","Mugatu","Mugatu",
                         "Hansel","Hansel","Hansel","Hansel","Hansel"))

script.doc.matrix <- create_matrix(script$lines,language = "english",removeNumbers=TRUE, removeStopwords = TRUE, stemWords=FALSE)
script.matrix <- as.matrix(script.doc.matrix)

nb.script <- naiveBayes(script.matrix,people)

nb.predict <- predict(nb.script,script$lines)
nb.predict

我的問題:

A)naiveBayes輸出:

我跑的時候

nb.script$tables

我得到這樣的表格:

$young
           young
people      [,1]   [,2]
  Hansel     0.0 0.0000000
  Mugatu     0.2 0.4472136
  Zoolander  0.0 0.0000000

我怎么解釋這個? 我認為這些應該是概率,但我不明白每一列,[,1]和[,2]是什么意思。 另外,這些表中的概率不應該加到1.0嗎? 他們為什么不呢? 如果有第三列,那會有意義嗎?

我應該在naiveBayes()使用type=raw嗎?

B)預測naiveBayes():

輸出給了我Hansel作為每個條目的預測。 我相信這種情況正在發生,因為它按字母順序排在第一堂課。 在我的預測中,如果Hansel被列為4x,Mugatu 6x和Zoolander 5x,那么predict()函數最終會給我Mugatu作為每個條目的預測,因為它在類向量中列出的次數最多。

編輯:對於我的問題...我如何得到預測給我一個ACTUAL預測?

預測的輸出如下:

“> nb.predict

[1] Hansel Hansel Hansel Hansel Hansel Hansel Hansel Hansel Hansel Hansel Hansel [12] Hansel Hansel Hansel Hansel

級別:Hansel Mugatu Zoolander

以下是一個類似問題的鏈接: R:Naives貝葉斯分類器基礎只決定先驗概率然而答案並沒有真正幫助我太多。

提前致謝!

對於問題的第一部分,矩陣script.matrix的列是數字。 naiveBayes將數字輸入解釋為來自高斯分布的連續數據。 您在答案中看到的表格給出了因子類別中這些數字變量的樣本均值(第1列)和標准差(第2列)。

您可能想要的是讓naiveBayes認識到您的輸入變量是指標。 一種簡單的方法是將整個script.matrix轉換為字符矩陣:

# Convert columns to characters    
script.matrix <- apply(as.matrix(script.doc.matrix),2,as.character)

有了這個改變:

> nb.predict <- predict(nb.script,script$lines)
> nb.script$tables$young
           young
people        0   1
  Hansel    1.0 0.0
  Mugatu    0.8 0.2
  Zoolander 1.0 0.0

要查看預測的課程:

> nb.predict <- predict(nb.script, script.matrix)
> nb.predict
 [1] Zoolander Zoolander Zoolander Zoolander Zoolander Mugatu    Mugatu   
 [8] Mugatu    Mugatu    Mugatu    Hansel    Hansel    Hansel    Hansel   
[15] Hansel   
Levels: Hansel Mugatu Zoolander

要查看naiveBayes擬合的原始概率:

predict(nb.script, script.matrix, type='raw')

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM