简体   繁体   中英

Sentiment analysis using pyspark

Since I am all new to pyspark , can anyone help me with the pyspark implementation of sentiment analysis . I have done the Python implementation. Can anyone tell me what changes are to be made?

import nltk
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from nltk.classify import NaiveBayesClassifier

def format_sentence(sent):
  return({word: True for word in nltk.word_tokenize(sent)})

#print(format_sentence("The cat is very cute"))

pos = []
with open("./pos_tweets.txt") as f:
for i in f: 
    pos.append([format_sentence(i), 'pos'])

neg = []
with open("./neg_tweets.txt") as fp:
for i in fp: 
    neg.append([format_sentence(i), 'neg'])

# next, split labeled data into the training and test data
training = pos[:int((.8)*len(pos))] + neg[:int((.8)*len(neg))]
test = pos[int((.8)*len(pos)):] + neg[int((.8)*len(neg)):]

classifier = NaiveBayesClassifier.train(training)

example1 = "no!"

print(classifier.classify(format_sentence(example1)))

The pattern would typically be:

  • convert your data into a spark DataFrame

    df = spark.read.csv('./neg_tweets.txt')

  • you can use train/test split here:

    df.randomSplit([0.8, 0.2])

  • find a suitable model: if naive bayes works for you it will look somethig like this

    import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}

    Otherwise, for sentiment analysis there may not be one precisely built in to spark.ml/mllib . You may need to look for external projects.

    • Iterate, iterate on the model and tuning parameters..

    • You can run an evaluator for the metrics you decide are important to your problem. Some examples for binary classification problems are here:

https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification

metrics = BinaryClassificationMetrics(predictionAndLabels)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM