简体   繁体   English

德语扼杀Python NLTK中的情感分析

[英]German Stemming for Sentiment Analysis in Python NLTK

I've recently begun working on a sentiment analysis project on German texts and I'm planning on using a stemmer to improve the results. 我最近开始研究关于德语文本的情绪分析项目,我计划使用词干分析器来改善结果。

NLTK comes with a German Snowball Stemmer and I've already tried to use it, but I'm unsure about the results. NLTK配有德国Snowball Stemmer并且我已经尝试过使用它,但我不确定结果。 Maybe it should be this way, but as a computer scientist and not a linguist, I have a problem with inflected verb forms stemmed to a different stem. 也许它应该是这样的,但作为一个计算机科学家,而不是一个语言学家,我有一个问题,变形动词形式源于不同的词干。

Take the word "suchen" (to search), which is stemmed to "such" for 1st person singular but to "sucht" for 3rd person singular. 取“suchen”(搜索)这个词,对于第一人称单数而言应该是“这样的”,而对于第三人称单数则是“如此”。

I know there is also lemmatization, but no working German lemmatizer is integrated into NLTK as far as I know. 我知道还有词形还原,但据我所知,没有可用的德语引理器集成到NLTK中。 There is GermaNet, but their NLTK integration seems to have been aborted. 有GermaNet,但他们的NLTK集成似乎已经中止。

Getting to the point: I would like inflected verb forms to be stemmed to the same stem, at the very least for regular verbs within the same tense. 重点:我想将变形动词形式归结为相同的词干,至少对于同一时态内的常规动词。 If this is not a useful requirement for my goal, please tell me why. 如果这对我的目标不是一个有用的要求,请告诉我原因。 If it is, do you know of any additional resources to use which can help me achieve this goal? 如果是,您是否知道可以使用哪些其他资源来帮助我实现这一目标?

Edit: I forgot to mention, any software should be free to use for educational and research purposes. 编辑:我忘了提及,任何软件都应该可以免费用于教育和研究目的。

As a computer scientist, you are definitely looking in the right direction to tackle this linguistic issue ;). 作为一名计算机科学家,您肯定正在寻找解决这一语言问题的正确方向;)。 Stemming is usually quite a bit more simplistic, and used for Information Retrieval tasks in an attempt to decrease the lexicon size, but usually not sufficient for more sophisticated linguistic analysis. 词干通常更加简单化,并用于信息检索任务以试图减少词典大小,但通常不足以进行更复杂的语言分析。 Lemmatisation partly overlaps with the use case for stemming, but includes rewriting for example verb inflections all to the same root form (lemma), and also differentiating "work" as a noun and "work" as a verb (although this depends a bit on the implementation and quality of the lemmatiser). Lemmatisation部分地与词干的用例重叠,但包括重写例如动词变形全部到相同的根形式(引理),并且还将“work”区分为名词而将“work”区分为动词(尽管这取决于一点lemmatiser的实施和质量)。 For this, it usually needs a bit more information (like POS-tags, syntax trees), hence takes considerably longer, rendering it less suitable for IR tasks, typically dealing with larger amounts of data. 为此,它通常需要更多信息(如POS标签,语法树),因此需要相当长的时间,使其不太适合IR任务,通常处理大量数据。

In addition to GermaNet (didn't know it was aborted, but never really tried it, because it is free, but you have to sign an agreement to get access to it), there is SpaCy which you could have a look at: https://spacy.io/docs/usage/ 除了GermaNet(不知道它已经中止,但从未真正尝试过,因为它是免费的,但你必须签署协议才能访问它),有SpaCy,你可以看看: https ://spacy.io/docs/usage/

Very easy to install and use. 非常容易安装和使用。 See install instructions on the website, then download the German stuff using: 请参阅网站上的安装说明,然后使用以下方式下载德语内容:

python -m spacy download de

then: 然后:

>>> import spacy
>>> nlp = spacy.load('de')
>>> doc = nlp('Wir suchen ein Beispiel')
>>> for token in doc:
...     print(token, token.lemma, token.lemma_)
... 
Wir 521 wir
suchen 1162 suchen
ein 486 ein
Beispiel 809 Beispiel
>>> doc = nlp('Er sucht ein Beispiel')
>>> for token in doc:
...     print(token, token.lemma, token.lemma_)
... 
Er 513 er
sucht 1901 sucht
ein 486 ein
Beispiel 809 Beispiel

As you can see, unfortunately it doesn't do a very good job on your specific example (suchen), and I'm not sure what the number represents (ie must be the lemma id, but not sure what other information can be obtained from this), but maybe you can give it a go and see if it helps you. 正如您所看到的,遗憾的是它在您的具体示例(suchen)上没有做得很好,而且我不确定数字代表什么(即必须是引理ID,但不确定可以获得其他信息从这个),但也许你可以试一试,看看它是否对你有所帮助。

A good and easy solution is to use the TreeTagger. 一个好的和简单的解决方案是使用TreeTagger。 First you have to install the treetagge manually (which is basically unzipping the right zip-file somewhere on your computer). 首先,您必须手动安装treetagge(这基本上是在计算机上的某个位置拉开正确的zip文件)。 You will find the binary distribution here: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ 你会在这里找到二进制发行版: http//www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

Then you need to install a wrapper to call it fron Python. 然后你需要安装一个包装器来从Python调用它。

The folowing code installs the wrapper and lemmatizes a tokenized sentence: 下面的代码安装包装器并将一个标记化的句子解释:

import treetaggerwrapper

tagger = treetaggerwrapper.TreeTagger(TAGLANG='de')

tags = tagger.tag_text(tokenized_sent,tagonly=True) #don't use the TreeTagger's tokenization!

pprint.pprint(tags)

You also can use a method form the treetaggerwrapper to make nice objects out of the Treetagges output: 您还可以使用treetaggerwrapper中的方法从Treetagges输出中创建漂亮的对象:

tags2 = treetaggerwrapper.make_tags(tags)
pprint.pprint(tags2)

That is all. 就这些。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM