Since I am all new to pyspark
, can anyone help me with the pyspark
implementation of sentiment analysis . I have done the Python implementation. Can anyone tell me what changes are to be made?
import nltk
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from nltk.classify import NaiveBayesClassifier
def format_sentence(sent):
return({word: True for word in nltk.word_tokenize(sent)})
#print(format_sentence("The cat is very cute"))
pos = []
with open("./pos_tweets.txt") as f:
for i in f:
pos.append([format_sentence(i), 'pos'])
neg = []
with open("./neg_tweets.txt") as fp:
for i in fp:
neg.append([format_sentence(i), 'neg'])
# next, split labeled data into the training and test data
training = pos[:int((.8)*len(pos))] + neg[:int((.8)*len(neg))]
test = pos[int((.8)*len(pos)):] + neg[int((.8)*len(neg)):]
classifier = NaiveBayesClassifier.train(training)
example1 = "no!"
print(classifier.classify(format_sentence(example1)))
The pattern would typically be:
convert your data into a spark DataFrame
df = spark.read.csv('./neg_tweets.txt')
you can use train/test split here:
df.randomSplit([0.8, 0.2])
find a suitable model: if naive bayes
works for you it will look somethig like this
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
Otherwise, for sentiment analysis
there may not be one precisely built in to spark.ml/mllib
. You may need to look for external projects.
Iterate, iterate on the model and tuning parameters..
You can run an evaluator
for the metrics you decide are important to your problem. Some examples for binary classification
problems are here:
https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification
metrics = BinaryClassificationMetrics(predictionAndLabels)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.