繁体   English   中英

用于情感分析的Python

[英]Python for sentiment analysis

我有一个示例代码如下,它使用来自nltk语料库的训练和测试数据并打印出句子的情感。 我想做的是用任何文本替换测试数据集。

from nltk.classify import NaiveBayesClassifier
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

n_instances = 100

# Each document is represented by a tuple (sentence, label).
# The sentence is tokenized, so it is represented by a list of strings:
subj_docs = [(sent, 'subj') for sent in subjectivity.sents(categories='subj')[:n_instances]]
obj_docs = [(sent, 'obj') for sent in subjectivity.sents(categories='obj')[:n_instances]]

# split subjective and objective instances to keep a balanced uniform class distribution
# in both train and test sets
train_subj_docs = subj_docs[:80]
test_subj_docs = subj_docs[80:100]
train_obj_docs = obj_docs[:80]
test_obj_docs = obj_docs[80:100]
training_docs = train_subj_docs+train_obj_docs
testing_docs = test_subj_docs+test_obj_docs


sentim_analyzer = SentimentAnalyzer()
all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])

# simple unigram word features, handling negation
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

# apply features to obtain a feature-value representation of our datasets
training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(testing_docs)

# train the Naive Bayes classifier on the training set
trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)

# output evaluation results
for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
    print('{0}: {1}'.format(key, value))

因此,当我尝试用存储文本的变量替换testing_docs时,就像paragraph = "Hello World, this is a test dataset" 我收到此错误消息ValueError: too many values to unpack (expected 2)

有人知道如何解决此错误吗? 谢谢。

这是因为testing_docs不是字符串,而是元组列表。 打印出示例中的testing_docs的值,如果要用paragraphs替换它,请确保其使用相同的格式。

如果要了解错误,请首先阅读并了解元组拆包

这个简单的示例将其复制:

>>> a = 'abc'
>>> b,c=a
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: too many values to unpack (expected 2)

这是因为上面的字符串具有三个值,因此要解压缩该字符串,必须将其分配给三个变量(即b,c,d=a有效)。

但是testing_docs实际上更类似于

a = [
    ('a','subj'),
    ('b','subj'),
    ('c','obj')
]

(尽管我高度怀疑每个元组的第一个元素是单个字符。)

我的猜测是,在代码的某个地方,你会发现尝试的值解压循环testing_docs两个变量所以像

for val, category in testing_docs:
    ...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM