Sentiment analysis using pyspark

Question

Since I am all new to pyspark , can anyone help me with the pyspark implementation of sentiment analysis . I have done the Python implementation. Can anyone tell me what changes are to be made?

import nltk
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from nltk.classify import NaiveBayesClassifier

def format_sentence(sent):
  return({word: True for word in nltk.word_tokenize(sent)})

#print(format_sentence("The cat is very cute"))

pos = []
with open("./pos_tweets.txt") as f:
for i in f: 
    pos.append([format_sentence(i), 'pos'])

neg = []
with open("./neg_tweets.txt") as fp:
for i in fp: 
    neg.append([format_sentence(i), 'neg'])

# next, split labeled data into the training and test data
training = pos[:int((.8)*len(pos))] + neg[:int((.8)*len(neg))]
test = pos[int((.8)*len(pos)):] + neg[int((.8)*len(neg)):]

classifier = NaiveBayesClassifier.train(training)

example1 = "no!"

print(classifier.classify(format_sentence(example1)))

Answer 1

The pattern would typically be:

convert your data into a spark DataFrame
df = spark.read.csv('./neg_tweets.txt')
you can use train/test split here:
df.randomSplit([0.8, 0.2])
find a suitable model: if naive bayes works for you it will look somethig like this
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}

Otherwise, for sentiment analysis there may not be one precisely built in to spark.ml/mllib . You may need to look for external projects.
- Iterate, iterate on the model and tuning parameters..
- You can run an evaluator for the metrics you decide are important to your problem. Some examples for binary classification problems are here:

https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification

metrics = BinaryClassificationMetrics(predictionAndLabels)

Sentiment analysis using pyspark

Question

1 answers

solution1
0 2018-04-02 02:31:19

Sentiment analysis using pyspark

Question

1 answers

solution1 0 2018-04-02 02:31:19

solution1
0 2018-04-02 02:31:19