简体繁体 English

有没有办法获取OpenNLP的“原始”文本数据？

[英]Is there a way to get the “original” text data for OpenNLP?

原文 2015-09-19 13:03:37 6 2 java/ nlp/ opennlp

I know that this question was asked before - but the answer was not satisfying (in the sense of that the answer was just a link ). 我知道以前曾问过这个问题-但答案并不令人满意（从某种意义上来说，答案只是一个链接）。

So my question is, is there any way to extend the existing openNLP models? 所以我的问题是，有没有办法扩展现有的openNLP模型？ I already know about the technique with DBPedia/Wikipedia. 我已经知道DBPedia / Wikipedia的技术。 But what if i just want to append some lines of text to improve the models - is there really no way? 但是，如果我只是想添加一些文本行来改进模型怎么办-真的没有办法吗？ (If so - that would be really stupid...) （如果是这样，那真是愚蠢...）

2 个解决方案

Unfortunately, you can't. 不幸的是，你不能。 See this question which has a detailed answer to the same problem. 请参阅此问题，该问题对相同的问题有详细的答案。

I think, that is a though problem because when you deal with texts you have often licensing issues. 我认为，这是一个问题，因为在处理文本时，您经常会遇到许可问题。 For example, you can not build a corpus on Twitter data and publish it to the community (see this paper for some more information). 例如，您不能在Twitter数据上建立语料库并将其发布到社区（有关更多信息，请参见本文）。

Therefore, often companies build domain specific corpora and use them internally. 因此，公司通常会构建特定领域的语料库，并在内部使用它们。 For example, we did in our research project. 例如，我们在研究项目中做了。 Therefore, we built a tool (Quick Pad Tagger) to create annotated corpora efficiently (see here ). 因此，我们构建了一个工具（Quick Pad Tagger）来有效地创建带注释的语料库（请参阅此处）。

Ok i think this needs a separate answer. 好的，我认为这需要一个单独的答案。 I found the Yago database: http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago// 我找到了Yago数据库： http ://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago//

This database seems to be just fantastic (from the first look). 这个数据库看起来很棒（从初看起来）。 You can download all the tagged data and put it in a database (they already deliver the tools for that). 您可以下载所有标记的数据并将其放入数据库（它们已经提供了用于该数据的工具）。

The next stage is to "refactor" the tagged entities so that opennlp can use it (openNLP uses sth. like this <START:person> Pierre Vinken <END> ) 下一步是“重构”已标记的实体，以便opennlp可以使用它（openNLP使用诸如此类的<START:person> Pierre Vinken <END> ）。

Then you create some text files and train it with the opennlp delivered training tool. 然后，您创建一些文本文件，并使用opennlp提供的培训工具对其进行培训。

Not 100% sure if this works but i will come back and tell you. 不能100％确定这是否有效，但我会回来告诉您。