简体繁体 English

OpenNLP与斯坦福CoreNLP

[英]OpenNLP vs Stanford CoreNLP

原文 2016-10-13 16:08:58 6 3 java/ stanford-nlp/ opennlp

I've been doing a little comparison of these two packages and am not sure which direction to go in. What I am looking for briefly is: 我一直在对这两个包进行一些比较，并且不确定要进入哪个方向。我正在寻找的是：

Named Entity Recognition (people, places, organizations and such). 命名实体识别（人员，地点，组织等）。
Gender identification. 性别认同。
A decent training API. 一个体面的培训API。

From what I can tell, OpenNLP and Stanford CoreNLP expose pretty similar capabilities. 据我所知，OpenNLP和Stanford CoreNLP提供了非常相似的功能。 However, Stanford CoreNLP looks like it has a lot more activity whereas OpenNLP has only had a few commits in the last six months. 然而，斯坦福CoreNLP似乎有更多活动，而OpenNLP在过去六个月中只有一些提交。

Based on what I saw, OpenNLP appears to be easier to train new models and might be more attractive for that reason alone. 根据我所看到的，OpenNLP似乎更容易训练新模型，仅凭这个原因可能更具吸引力。 However, my question is what would others start with as the basis for adding NLP features to a Java app? 但是，我的问题是其他人会将其作为添加NLP功能到Java应用程序的基础？ I'm mostly worried as to whether OpenNLP is "just mature" versus semi-abandoned. 我最担心的是OpenNLP是“刚刚成熟”还是半成熟。

3 个解决方案

In full disclosure, I'm a contributor to CoreNLP, so this is a biased answer. 在完全披露中，我是CoreNLP的贡献者，所以这是一个有偏见的答案。 But, in my view on your three criteria: 但是，在我看来你的三个标准：

Named Entity Recognition: I think CoreNLP clearly wins here, both on accuracy and ease-of-use. 命名实体识别：我认为CoreNLP在准确性和易用性方面明显胜出。 For one, OpenNLP has a model per NER tag, whereas CoreNLP detects all tags with a single Annotator. 例如，OpenNLP每个NER标签都有一个模型，而CoreNLP使用一个Annotator检测所有标签。 Furthermore, temporal resolution with SUTime is a nice perk in CoreNLP. 此外，使用SUTime的时间分辨率是CoreNLP中的一个很好的特权。 Accuracy-wise, my anecdotal experience is that CoreNLP does better on general-purpose text. 准确性方面，我的轶事经验是CoreNLP在通用文本方面做得更好。
Gender identification. 性别认同。 I think both tools are kind of poorly documented on this front. 我认为这两种工具在这方面都很难记录。 OpenNLP seems to have a GenderModel class; OpenNLP似乎有一个GenderModel类; CoreNLP has a gender Annotator. CoreNLP有一个性别注释器。
Training API. 培训API。 I suspect the OpenNLP training API is easier-to-use for not off-the-shelf training. 我怀疑OpenNLP培训API更易于使用而不是现成的培训。 But, if all you want to do is, eg, train a model from a CoNLL file, both should be straightforward. 但是，如果你想做的只是，例如，从CoNLL文件中训练模型，两者都应该是直截了当的。 Training speed tends to be faster with CoreNLP than other tools I've tried, but I haven't benchmarked it formally, so take that with a grain of salt. CoreNLP的训练速度往往比我尝试过的其他工具更快，但是我还没有正式对它进行基准测试，所以请稍等一下。

A bit late here, but I recently looking at OpenNLP based just on the fact that Stanford is GPL licenced - if thats ok for your project then Stanford is often referred to as the benchmark/state-of-the-art for NLP. 这里有点晚了，但我最近看看OpenNLP仅基于斯坦福获得GPL许可的事实 - 如果这对您的项目来说可行，那么斯坦福通常被称为NLP的基准/最先进技术。

That said, the performance for the pre-trained models will depend on your target text as it is very domain specific. 也就是说，预训练模型的性能将取决于您的目标文本，因为它是特定于域的。 If your target text is similar to the data that the models were trained against then you should get decent results, but if not then you will have to train the models yourself and it will depend on the training data. 如果您的目标文本与模型所训练的数据类似，那么您应该获得不错的结果，但如果没有，那么您将不得不自己训练模型，这将取决于训练数据。

A strength of OpenNlp it that it is very extensible and is written for easy use with other libraries and has a good API for integrating - the training is very simple (once you have your training data) with OpenNLP ( I wrote about it here - with a pretty lousy generated data set I was able to get ok results identifying foods ), and it is very configurable - you can configure all the parameters around training very easily and there are a range of algorithms you can use (perceptron, max entropy, and in the snapshot version they have added Naive Bayes ) OpenNlp的优势在于它非常易于扩展，并且易于与其他库一起使用而编写，并且具有良好的集成API - 使用OpenNLP进行培训非常简单（一旦获得了训练数据）（我在这里写了一下 - 与一个非常糟糕的生成数据集我能够获得确定食物的结果，并且它非常易于配置 - 您可以非常容易地配置训练周围的所有参数，并且您可以使用一系列算法（感知器，最大熵和在快照版本中他们添加了Naive Bayes ）

If you find that you do need to train the models yourself, I would consider trying out OpenNlp and seeing how it performs just for comparison, as with fine tuning you can get pretty decent results. 如果你发现你确实需要自己训练模型，我会考虑尝试OpenNlp，看看它是如何进行比较的，就像微调一样，你可以得到相当不错的结果。

That depends on your purpose and need, what i know about these two is OpenNLP is opensource and CoreNLP is not of course. 这取决于你的目的和需要，我对这两者的了解是OpenNLP是开源的，而CoreNLP当然不是。

But If you will look at the accuracy level Stanford CoreNLP have more accurate detection than OpenNLP . 但是如果你看一下准确度， Stanford CoreNLP检测精度比OpenNLP更准确。 Recently I did comparison for the Part Of Speech (POS) tagging for both and yes which is the most imp part in any NLP task, So in my analysis the winner was CoreNLP . 最近我对两者的Part Of Speech (POS)标记做了比较，并且肯定是任何NLP任务中最重要的部分，所以在我的分析中，获胜者是CoreNLP 。

Going forward for NER there as well CoreNLP have the more accurate results compare to OpenNLP . 展望未来的NER有作为CoreNLP有更准确的结果比较OpenNLP 。

So if you are just starting you can take up OpenNLP later if needed you can migrate to Stanford CoreNLP . 因此，如果您刚刚开始，可以在以后根据需要使用OpenNLP ，您可以迁移到Stanford CoreNLP 。