简体   繁体   中英

Problems with Naive Bayes

I'm trying to run Naive Bayes in R for making predictions from textual data (by building a Document Term Matrix).

I read several posts warning about terms that could be missing in both the training and the testing set, so I decided to work with only one data frame and split it afterwards. The code I'm using is this:

data <- read.csv(file="path",header=TRUE)

########## NAIVE BAYES
library(e1071)
library(SparseM)
library(tm)

# CREATE DATA FRAME AND TRAINING AND
# TEST INCLUDING 'Text' AND 'InfoType' (columns 8 and 27)
traindata <- as.data.frame(data[13000:13999,c(8,27)])
testdata <- as.data.frame(data[14000:14999,c(8,27)])
complete <- as.data.frame(data[13000:14999,c(8,27)])

# SEPARATE TEXT VECTOR TO CREATE Source(),
# Corpus() CONSTRUCTOR FOR DOCUMENT TERM
# MATRIX TAKES Source()
completevector <- as.vector(complete$Text)

# CREATE SOURCE FOR VECTORS
completesource <- VectorSource(completevector)

# CREATE CORPUS FOR DATA
completecorpus <- Corpus(completesource)

# STEM WORDS, REMOVE STOPWORDS, TRIM WHITESPACE
completecorpus <- tm_map(completecorpus,tolower)
        completecorpus <- tm_map(completecorpus,PlainTextDocument)
        completecorpus <- tm_map(completecorpus, stemDocument)
completecorpus <- tm_map(completecorpus, removeWords,stopwords("english"))
        completecorpus <- tm_map(completecorpus,removePunctuation)
        completecorpus <- tm_map(completecorpus,removeNumbers)
        completecorpus <- tm_map(completecorpus,stripWhitespace)

# CREATE DOCUMENT TERM MATRIX
completematrix<-DocumentTermMatrix(completecorpus)
trainmatrix <- completematrix[1:1000,]
testmatrix <- completematrix[1001:2000,]

# TRAIN NAIVE BAYES MODEL USING trainmatrix DATA AND traindata$InfoType CLASS VECTOR
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$InfoType),laplace=1)

# PREDICTION
results <- predict(model,as.matrix(testmatrix))
conf.matrix<-table(results, testdata$InfoType,dnn=list('predicted','actual'))

conf.matrix

The problem is that I'm getting weird results like this:

               actual
predicted    1   2   3
         1  60 833 107
         2   0   0   0
         3   0   0   0

Any idea of why is this happening?

The raw data looks like this:

head(complete)

      Text
13000 Milkshakes, milkshakes, whats not to love? Really like the durability and weight of the cup. Something about it sure makes good milkshakes.Works beautifully with the Cuisinart smart stick.
13001 excellent. shipped on time, is excellent for protein shakes with a cuisine art mixer.  easy to clean and the mixer fits in perfectly
13002 Great cup. Simple and stainless steel great size cup for use with my cuisinart mixer.  I can do milkshakes really easy and fast. Recommended. No problems with the shipping.
13003 Wife Loves This. Stainless steel....attractive and the best part is---it won't break. We are considering purchasing another one because they are really nice.
13004 Great! Stainless steel cup is great for smoothies, milkshakes and even chopping small amounts of vegetables for salads!Wish it had a top but still love it!
13005 Great with my. Stick mixer...the plastic mixing container cracked and became unusable as a result....the only downside is you can't see if the stuff you are mixing is mixed well 

      InfoType
13000        2
13001        2
13002        2
13003        3
13004        2
13005        2

Seemingly the problem is that the TDM needs to get rid of so much sparsity. So I added:

completematrix<-removeSparseTerms(completematrix, 0.95)

And it started working!!

             actual
predicted   1   2   3
        1  60 511   6
        2   0  86   2
        3   0 236  99

Thank you all for your ideas (thank you Chelsey Hill!!)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM