简体繁体 English

在大文件中对否定词和肯定词进行分类？

[英]Classifying negative and positive words in large files?

原文 2018-11-01 13:45:10 1 1 nlp/ nltk/ sentiment-analysis/ wordnet/ senti-wordnet

I am trying to get the count of positive and negative in a very large file. 我试图在一个非常大的文件中获得正数和负数的计数。 I only need a primitive approach(that does not take ages). 我只需要一个原始的方法（不需要花很多时间）。 I have tried sentiwordnet but keep getting a IndexError: list index out of range , which I think it's due to the words not being listed in wordnet dictionary. 我尝试了sendiwordnet，但始终收到IndexError: list index out of range ，这是由于单词未在wordnet词典中列出。 The text contains a lot of typos and 'non-words'. 文本中包含很多错别字和“非单词”。

If someone could give any suggestion, I would be very grateful! 如果有人可以提出任何建议，我将不胜感激！

1 个解决方案

It all depends on what your data is like and what is the final objective of your task. 这完全取决于您的数据是什么样的以及任务的最终目标是什么。 You need to give us a little bit more detailed description of your project but, in general, here are your options: - Make your own sentiment analysis dictionary: I really doubt this is what you want to do since it takes a lots of time and effort but if your data is simple enough it's doable. 您需要给我们一些有关您的项目的更详细的描述，但是总的来说，您可以选择以下选项：-编写自己的情感分析词典：我真的怀疑这是您要执行的操作，因为这需要花费大量时间，并且努力，但是如果您的数据足够简单，那是可行的。 - Clean your data: if your tokens aren't in senti-wordnet because there's too much noise and badly spelled words, then try to correct them before passing them through wordnet, it will at least limit the number of errors you'll get. -清理数据：如果由于噪音过多和拼写错误的单词而使令牌不在senti-wordnet中，请在将其通过wordnet之前尝试对其进行更正，这将至少限制您将获得的错误数量。 - Use a senti-wordnet alternative: accorded, there aren't that many good ones but you can always try sentiment_classifier or nltk's sentiment if you're using python (which by the looks of your error seems like you are). -使用senti，共发现可供选择：符合，有没有那么多好的，但你可以尝试sentiment_classifier或NLTK的情绪，如果你使用python（由你的错误看起来好像你是）。 - Classify only what you can: this is what I would recommend. -只对您可以进行的分类：这是我的建议。 If the word is not in senti-wordnet, then move on to the next one. 如果该单词不在senti-wordnet中，请继续进行下一个。 Just catch the error ( try: ... except IndexError: pass ) and try to infer what the general sentiment of the data is by counting the sentiment words you actually catch. 只需捕获错误（ try: ... except IndexError: pass ），然后通过计算您实际捕获的情感词来推断数据的总体情感是什么。

PS: We would need to see your code to be sure but I think there's another reason why you're getting an IndexError. PS：我们需要确定您的代码，但是我认为还有另一个原因导致您收到IndexError。 If the word was not in senti-wordnet you would be getting a KeyError, but it also depends on how you coded your function. 如果单词不在sendi-wordnet中，则将得到KeyError，但这还取决于您对函数进行编码的方式。

Good luck and I hope it was helpful. 祝您好运，希望对您有所帮助。