[英]Sentiment analysis using pyspark
Since I am all new to pyspark
, can anyone help me with the pyspark
implementation of sentiment analysis . 由于我是
pyspark
,因此有人可以帮助pyspark
实现情感分析 。 I have done the Python implementation. 我已经完成了Python实现。 Can anyone tell me what changes are to be made?
谁能告诉我要进行哪些更改?
import nltk
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from nltk.classify import NaiveBayesClassifier
def format_sentence(sent):
return({word: True for word in nltk.word_tokenize(sent)})
#print(format_sentence("The cat is very cute"))
pos = []
with open("./pos_tweets.txt") as f:
for i in f:
pos.append([format_sentence(i), 'pos'])
neg = []
with open("./neg_tweets.txt") as fp:
for i in fp:
neg.append([format_sentence(i), 'neg'])
# next, split labeled data into the training and test data
training = pos[:int((.8)*len(pos))] + neg[:int((.8)*len(neg))]
test = pos[int((.8)*len(pos)):] + neg[int((.8)*len(neg)):]
classifier = NaiveBayesClassifier.train(training)
example1 = "no!"
print(classifier.classify(format_sentence(example1)))
The pattern would typically be: 该模式通常为:
convert your data into a spark DataFrame
将您的数据转换为
DataFrame
df = spark.read.csv('./neg_tweets.txt')
you can use train/test split here: 您可以在此处使用训练/测试拆分:
df.randomSplit([0.8, 0.2])
find a suitable model: if naive bayes
works for you it will look somethig like this 找到一个合适的模型:如果
naive bayes
为您工作,它将看起来像这样
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
Otherwise, for sentiment analysis
there may not be one precisely built in to spark.ml/mllib
. 否则,对于
sentiment analysis
, spark.ml/mllib
可能没有精确内置的spark.ml/mllib
。 You may need to look for external projects. 您可能需要寻找外部项目。
Iterate, iterate on the model and tuning parameters.. 迭代,迭代模型和调整参数。
You can run an evaluator
for the metrics you decide are important to your problem. 您可以针对您认为对您的问题很重要的指标运行
evaluator
程序。 Some examples for binary classification
problems are here: binary classification
问题的一些示例在这里:
https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification
metrics = BinaryClassificationMetrics(predictionAndLabels)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.