简体   繁体   English

评论星级-R中的预测

[英]Review star rating - prediction in R

I have a dataset of reviews that have the following structure: 我有一个具有以下结构的评论数据集:

{
"reviewerID": "XXXX",
"asin": "12345XXX",
"reviewerName": "Paul",
"helpful": [2, 5],
"reviewText": "Nice product, works as it should.",
"overall": 5.0,
"summary": "Nice product",
"unixReviewTime": 1152700000,
"reviewTime": "08 14, 2010"
}

I have got a bunch of reviews and would like to create a forecast based on the text of the review ("reviewText") using some text mining techniques. 我有很多评论,并希望使用一些文本挖掘技术基于评论文本(“ reviewText”)创建一个预测。

I would like to train a classifier and then have an accuracy measure how well the system works. 我想训练一个分类器,然后对系统的运行情况进行准确度评估。 The rating of each review is given ("overall"). 给出每个评论的等级(“总体”)。

So far I did the following: 到目前为止,我执行了以下操作:

Required packages (not all are required) 必需的软件包(不是全部)

library(plyr)
library(rjson)
library(magrittr)
library(lubridate)
library(stringi)
library(doSNOW)
library(tm)
library(NLP)
library(wordcloud)
library(SnowballC)
library(rpart)

The input data is available in JSON format: 输入数据以JSON格式提供:

Sample Input 样本输入

Out of this table reviewTexts are converted to a corpus. 从该表中将reviewText转换为语料库。

Create a corpus and apply some pre-processing steps 创建语料库并应用一些预处理步骤

corpus <- Corpus(VectorSource(tr.review.asin$reviewText))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument)

Making a document term matrix 制作文档术语矩阵

dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.999)

Creating a training and test set 创建训练和测试集

dtmsparse <- as.data.frame(as.matrix(dtm))
train <- dtmsparse[1:6500,]
test <- dtmsparse[6501:7561,]

Creating a model 建立模型

train$overall <- tr.review.asin[1:6500,]$overall
model <- rpart(overall ~., data = train, method= 'class')
mypred <- predict(model, newdata =test, type = 'class')

When I am plotting obs_test and mypred I am getting the following plot: 当我绘制obs_testmypred我得到以下图:

Plot obs_test and mypred 绘制obs_testmypred

Unfortunately I don't have an idea if the last lines will guide me to a solution. 不幸的是,我不知道最后几行是否会指导我找到解决方案。

I would like to have a procedure where I can test how well my model is forecasting (comparison between real overall rating and predicted rating). 我希望有一个过程可以测试模型的预测效果(实际总体评分与预测评分之间的比较)。

so it completely slipped my attention that you are actually dealing with a classification problem and not with regression. 因此,这完全让我无视了您实际上是在处理分类问题,而不是在处理回归问题。 hence a complete edit. 因此进行了完整的编辑。

to see how well a classification tree one would want to know how many instances (in the test data) were misclassified, ie the assigned class was not the same as the observed class. 要了解分类树的状况,您可能想知道多少个实例(在测试数据中)被错误分类,即分配的类别与观察到的类别不同。 it is also informative to see how well the prediction model works on each individual class. 查看预测模型在每个单独类别上的表现也很有帮助。 using confusionMatrix function from the caret package you can do the following: 使用caret包中的confusionMatrix函数,您可以执行以下操作:

 install.packages(`caret`)
 library(caret)

 mypred <- predict(model, newdata =test, type = 'class')
 obs <- tr.review.asin[6501:7561,]$overall

 confusionMatrix(obs, mypred)

you will get a confusion matrix and some stats as output. 您将获得一个混淆矩阵和一些统计信息作为输出。 confusion matrix tells you on how many instances predictions and observations coincide for each class -- these will be values on the diagonal. 混淆矩阵告诉您每个类的预测和观察值有多少个实例重合-这些将是对角线上的值。 in general ij th entry of the matrix will tell you how many instances were classified as j whilst the real class was i . 一般而言,矩阵的第ij个条目将告诉您有多少个实例被分类为j而真实类是i

in the Overall Statistics section of the confusionMatrix output you will see Accuracy -- this is the percentage of the instances in the test set that were classified correctly. confusionMatrix输出的“总体统计信息”部分,您将看到“准确性”-这是测试集中被正确分类的实例的百分比。

next in the Statistics by Class section the row named Pos Pred Value will tell you what percentage of onbservations in class x were classified correctly. 接下来,在“按类别统计”部分中,名为“ Pos Pred Value”的行将告诉您正确分类了x类中的观测的百分比。 there is a bunch of other useful statistics that the function outputs and you can read up on it on-line, for example here or here . 函数还输出了许多其他有用的统计信息,您可以在线阅读该统计信息,例如herehere

i hope this helps. 我希望这有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM