如何训练斯坦福NLP情绪分析工具

Question

Hell everyone! 大家好！ I'm using the Stanford Core NLP package and my goal is to perform sentiment analysis on a live-stream of tweets. 我正在使用Stanford Core NLP软件包，我的目标是在实时推文上进行情绪分析。

Using the sentiment analysis tool as is returns a very poor analysis of text's 'attitude' .. many positives are labeled neutral, many negatives rated positive. 使用情绪分析工具返回对文本“态度”的非常差的分析。许多正面被标记为中性，许多负面评价为正面。 I've gone ahead an acquired well over a million tweets in a text file, but I haven't a clue how to actually train the tool and create my own model. 我已经在一个文本文件中获得了超过一百万条推文，但我不知道如何实际训练该工具并创建我自己的模型。

Link to Stanford Sentiment Analysis page 链接到斯坦福情绪分析页面

"Models can be retrained using the following command using the PTB format dataset:" “可以使用PTB格式数据集使用以下命令重新训练模型：”

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath     dev.txt -train -model model.ser.gz

Sample from dev.txt (The leading 4 represents polarity out of 5 ... 4/5 positive) 来自dev.txt的样本（前4位表示5 ... 4/5正极性）

(4 (4 (2 A) (4 (3 (3 warm) (2 ,)) (3 funny))) (3 (2 ,) (3 (4 (4 engaging) (2 film)) (2 .))))

Sample from test.txt 来自test.txt的示例

(3 (3 (2 If) (3 (2 you) (3 (2 sometimes) (2 (2 like) (3 (2 to) (3 (3 (2 go) (2 (2 to) (2 (2 the) (2 movies)))) (3 (2 to) (3 (2 have) (4 fun))))))))) (2 (2 ,) (2 (2 Wasabi) (3 (3 (2 is) (2 (2 a) (2 (3 good) (2 (2 place) (2 (2 to) (2 start)))))) (2 .)))))

Sample from train.txt 来自train.txt的样本

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

I have two questions going forward. 我有两个问题要做。

What is the significance and difference between each file? 每个文件之间的意义和区别是什么？ Train.txt/Dev.txt/Test.txt ? Train.txt / Dev.txt / Test.txt？

How would I train my own model with a raw, unparsed text file full of tweets? 我如何使用充满推文的原始未解析文本文件训练我自己的模型？

I'm very new to NLP so if I am missing any required information or anything at all please critique! 我对NLP很新，所以如果我缺少任何必要的信息或任何事情请批评！ Thank you! 谢谢！

Answer 1

What is the significance and difference between each file? 每个文件之间的意义和区别是什么？ Train.txt/Dev.txt/Test.txt ? Train.txt / Dev.txt / Test.txt？

This is standard machine learning terminology. 这是标准的机器学习术语。 The train set is used to (surprise surprise) train a model. 火车组用于（惊喜）训练模型。 The development set is used to tune any parameters the model might have. 开发集用于调整模型可能具有的任何参数。 What you would normally do is pick a parameter value, train a model on the training set, and then check how well the trained model does on the development set. 您通常要做的是选择参数值，在训练集上训练模型，然后检查训练模型在开发集上的表现。 You then pick another parameter value and repeat. 然后选择另一个参数值并重复。 This procedure helps you find reasonable parameter values for your model. 此过程可帮助您为模型找到合理的参数值。

Once this is done, you proceed to test how well the model does on the test set. 完成此操作后，您将继续测试模型在测试集上的表现。 This is unseen - your model has never encountered any of that data before. 这是看不见的 - 您的模型之前从未遇到过任何数据。 It is important that the test set is separate from the training and development set, otherwise you are effectively evaluating a model on data it has seen before. 重要的是测试集与训练和开发集分开，否则您正在有效地评估之前看到的数据模型。 This would be wrong as it will not give you an idea of how well the model really does. 这是错误的，因为它不会让你知道模型真正做得多好。

How would I train my own model with a raw, unparsed text file full of tweets? 我如何使用充满推文的原始未解析文本文件训练我自己的模型？

You can't and you shouldn't train using an unparsed set of documents. 您不能，也不应该使用未解析的文档进行训练。 The entire point of the recursive deep model (and the reason it performs so well) is that it can learn from the sentiment annotations at every level of the parse tree. 递归深度模型的整个点（以及它执行得如此好的原因）是它可以从解析树的每个级别的情感注释中学习。 The sentence you have given above can be formatted like this: 您上面给出的句子可以这样格式化：

(4 
    (4 
        (2 A) 
        (4 
            (3 (3 warm) (2 ,)) (3 funny)
        )
    ) 
    (3 
        (2 ,) 
        (3 
            (4 (4 engaging) (2 film)) (2 .)
        )
    )
)

Usually, a sentiment analyser is trained with document-level annotations. 通常，情绪分析器使用文档级注释进行训练。 You only have one score, and this score applies to the document as a whole, ignoring the fact that the phrases in the document may express different sentiment. 您只有一个分数，这个分数作为一个整体适用于整个文档，忽略了文档中的短语可能表达不同情绪的事实。 The Stanford team put a lot of effort into annotating every phrase in the document for sentiment. 斯坦福大学的团队花了很多精力来注释文档中的每个短语以表达情感。 For example, the word film on its own is neutral in sentiment: (2 film) . 例如，单词film本身就是中立的情绪:( (2 film) 。 However, the phrase engaging film is very positive: (4 (4 engaging) (2 film)) (2 .) 然而，短语engaging film非常积极： (4 (4 engaging) (2 film)) (2 .)

If you have labelled tweets, you can use any other document-level sentiment classifier. 如果您标记了推文，则可以使用任何其他文档级情绪分类器。 The sentiment-analysis tag on stackoverflow already has some very good answers, I'm not going to repeat them here. stackoverflow上的情感分析标签已经有了一些非常好的答案，我不打算在这里重复它们。

PS Did you label the tweets you have? PS你有标签你的推文吗？ All 1 million of them? 所有100万人？ If you did, I'd like to pay you a lot of money for that file :) 如果你这样做了，我想付给你很多钱给那个文件:)

Answer 2

The Java code: Java代码：

BuildBinarizedDataset -> [ http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/sentiment/BuildBinarizedDataset.html BuildBinarizedDataset - > [ http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/sentiment/BuildBinarizedDataset.html

SentimentTraining -> http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/sentiment/SentimentTraining.html SentimentTraining - > http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/sentiment/SentimentTraining.html

For those who code in C#, I converted the Java source into two code files which should make understanding this process a lot simpler. 对于那些使用C＃编写代码的人，我将Java源代码转换为两个代码文件，这使得理解这个过程变得更加简单。

https://arachnode.net/blogs/arachnode_net/archive/2015/09/03/buildbinarizeddataset-and-sentimenttraining-stanford-nlp.aspx https://arachnode.net/blogs/arachnode_net/archive/2015/09/03/buildbinarizeddataset-and-sentimenttraining-stanford-nlp.aspx

Answer 3

If it helps, I got the C# code from Arachnode working very easily - a tweak or two to get the right paths for models and so on, but it then works great. 如果它有所帮助，我从Arachnode获得的C＃代码非常容易 - 一两个调整以获得模型的正确路径，等等，但它的工作原理很棒。 What was missing was something about the right format for the input files. 缺少的是关于输入文件的正确格式。 It's in the Javadoc, but for reference, for BuildBinarizedDataset it's something like: 它在Javadoc中，但作为参考，对于BuildBinarizedDataset，它类似于：

2 line of text here

0 another line of text 

1 yet another line of text

etc

Building that is pretty trivial, depending on what you're starting with (a database, Excel file, whatever) 构建非常简单，取决于你开始的东西（数据库，Excel文件，等等）

如何训练斯坦福NLP情绪分析工具

问题描述

3 个解决方案

解决方案1
10 2014-03-25 12:54:14

解决方案2
1 2015-09-02 21:20:33

解决方案3
1 2016-10-07 08:49:18

如何训练斯坦福NLP情绪分析工具

问题描述

3 个解决方案

解决方案1 10 2014-03-25 12:54:14

解决方案2 1 2015-09-02 21:20:33

解决方案3 1 2016-10-07 08:49:18

解决方案1
10 2014-03-25 12:54:14

解决方案2
1 2015-09-02 21:20:33

解决方案3
1 2016-10-07 08:49:18