简体   繁体   English

关于情绪分析的问题

[英]question on sentiment analysis

I have a question regarding sentiment analysis that i need help with. 关于我需要帮助的情绪分析,我有一个问题。

Right now, I have a bunch of tweets I've gathered through the twitter search api. 现在,我收到了一些我通过twitter搜索API收集的推文。 Because I used my search terms, I know what are the subjects or entities (Person names) that I want to look at. 因为我使用了我的搜索词,所以我知道我想要查看的主题或实体(人名)是什么。 I want to know how others feel about these people. 我想知道其他人对这些人的看法。

For starters, I downloaded a list of english words with known valence/sentiment score and calculate the sentiments (+/-) based on availability of these words in the tweet. 首先,我下载了一个具有已知效价/情绪分数的英语单词列表,并根据推文中这些单词的可用性计算情绪(+/-)。 The problem is that sentiments calculated this way - I'm actually looking more at the tone of the tweet rather than ABOUT the person. 问题是情绪以这种方式计算 - 我实际上更多地关注推文的语气,而不是关于这个人。

For instance, I have this tweet: 例如,我有这条推文:

 "lol... Person A is a joke. lmao!" 

The message is obviously in a positive tone, but person A should get a negative. 消息显然是积极的,但A人应该是消极的。

To improve my sentiment analysis, I can probably take into account negation and modifiers from my word list. 为了改善我的情绪分析,我可以考虑我的单词列表中的否定和修饰符。 But how exactly can I get my sentiments analysis to look at the subject of the message (and possibly sarcasm) instead? 但是我怎样才能得到我的情绪分析来看待信息的主题(可能是讽刺)呢?

It would be great if someone can direct me towards some resources.... 如果有人能引导我走向某些资源,那就太棒了......

While awaiting for answers from researchers in AI field I will give you some clues on what you can do quickly. 在等待AI领域研究人员的答案时,我会给你一些关于你能快速做些什么的线索。

Even though this topic requires knowledge from natural language processing, machine learning and even psychology, you don't have to start from scratch unless you're desperate or have no trust in the quality of research going on in the field. 即使这个主题需要自然语言处理,机器学习甚至心理学方面的知识,你也不必从头开始,除非你绝望或不信任该领域正在进行的研究质量。

One possible approach to sentiment analysis would be to treat it as a supervised learning problem, where you have some small training corpus that includes human made annotations (later about that) and a testing corpus on which you test how well you approach/system is performing. 情感分析的一种可能方法是将其视为监督学习问题,其中您有一些小型培训语料库,其中包括人工注释(后来有关)和测试语料库,您可以在其上测试您的方法/系统的执行情况。 For training you will need some classifiers, like SVM, HMM or some others, but keep it simple. 对于训练,您将需要一些分类器,如SVM,HMM或其他一些分类器,但要保持简单。 I would start from binary classification: good, bad. 我会从二进制分类开始:好,坏。 You could do the same for a continuous spectrum of opinion ranges, from positive to negative, that is to get a ranking, like google, where the most valuable results come on top. 您可以对连续的意见范围进行相同的操作,从正面到负面,即获得排名,如谷歌,其中最有价值的结果排在最前面。

For a start check libsvm classifier , it is capable of doing both classification {good, bad} and regression (ranking). 对于开始检查libsvm分类器 ,它能够进行分类{好,坏}和回归(排名)。 The quality of annotations will have a massive influence on the results you get, but where to get it from? 注释的质量会对您获得的结果产生巨大影响,但是从哪里获得它?

I found one project about sentiment analysis that deals with restaurants. 我找到了一个关于餐馆情感分析的项目 There is both data and code, so you can see how they extracted features from natural language and which features scored high in the classification or regression. 有数据和代码,因此您可以看到他们如何从自然语言中提取特征以及在分类或回归中得分较高的特征。 The corpus consists of opinions of customers about restaurants they recently visited and gave some feedback about the food, service or atmosphere. 该语料库包括客户对他们最近访问过的餐馆的意见,并提供有关食品,服务或氛围的一些反馈。 The connection about their opinions and numerical world is expressed in terms of numbers of stars they gave to the restaurant. 关于他们的意见和数字世界的联系以他们给餐馆的星星数量表示。 You have natural language on one site and restaurant's rate on another. 您在一个网站上拥有自然语言,在另一个网站上拥有餐厅的价格。

Looking at this example you can devise your own approach for the problem stated. 看一下这个例子,您可以针对所述问题设计自己的方法。 Take a look at nltk as well. 看看nltk也是如此。 With nltk you can do part of speech tagging and with some luck get names as well. 使用nltk,您可以进行部分语音标记,并且运气也可以获得名称。 Having done that you can add a feature to your classifier that will assign a score to a name if within n words (skip n-gram) there are words expressing opinions (look at the restaurant corpus) or use weights you already have, but it's best to rely on a classfier to learn weights, that's his job. 完成后,您可以向分类器添加一个功能,如果在n个单词内(跳过n-gram),会有一个分数给一个名称,有表达意见的单词(查看餐馆语料库)或使用您已经拥有的权重,但它是最好依靠一个班主来学习重量,这是他的工作。

In the current state of technology this is impossible. 在当前的技术状态下,这是不可能的。

English (and any other language) is VERY complicated and cannot be "parsed" yet by programs. 英语(和任何其他语言) 非常复杂,无法通过程序“解析”。 Why? 为什么? Because EVERYTHING has to be special-cased. 因为一切都必须是特殊的。 Saying that someone is a joke is a special-case of a joke, which is another exception in your program. 说某人是个笑话是一个笑话的特例,这是你程序中的另一个例外。 Etcetera, etc, etc. Etcetera等等

A good example (posted by ScienceFriction somewhere here on SO): 一个很好的例子(由ScienceFriction发布在这里的SO):

Similarly, the sentiment word "unpredictable" could be positive in the context of a thriller but negative when describing the breaks system of the Toyota. 同样,情感词“不可预测”在惊悚片的背景下可能是积极的,但在描述丰田的休息系统时则是消极的。

If you are willing to spend +/-40 years of your life on this subject, go ahead, it will be much appreciated :) 如果你愿意在这个问题上花费+/- 40年的时间,那么请继续,非常感谢:)

I don't entirely agree with what nightcracker said. 我并不完全同意夜间爆竹的说法。 I agree that it is a hard problem, but we are making a good progress towards the solution. 我同意这是一个难题,但我们正朝着解决方案取得良好进展。

For example, 'part-of-speech' might help you to figure out subject, verb and object in the sentence. 例如,“词性”可以帮助您找出句子中的主语,动词和宾语。 And 'n-grams' might help you in the Toyota vs. thriller example to figure out the context. 并且'n-gram'可以帮助你在丰田与惊悚的例子中找出背景。 Look at TagHelperTools . 看看TagHelperTools It is built on top of weka and provides part-of-speech and n-grams tagging. 它建立在weka之上,提供词性和n-gram标记。

Still, it is difficult to get the results that OP wants, but it won't take 40 years. 尽管如此,很难获得OP想要的结果,但它不会花费40年。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM