简体   繁体   English

无论大小写如何如何在NTLK swadesh语料库中翻译单词-python

[英]How to translate words in NTLK swadesh corpus regardless of case - python

I'm new to python and natural language processing, and I'm trying to learn using the nltk book. 我是python和自然语言处理的新手,我正在尝试使用nltk书学习。 I'm doing the exercises at the end of Chapter 2, and there is a question I'm stuck on. 我在第2章的末尾进行练习,我遇到了一个问题。 "In the discussion of comparative wordlists, we created an object called translate which you could look up using words in both German and Italian in order to get corresponding words in English. What problem might arise with this approach? Can you suggest a way to avoid this problem?" “在比较单词列表的讨论中,我们创建了一个称为translate的对象,您可以使用德语和意大利语两个单词进行查找,以获得相应的英语单词。这种方法可能会出现什么问题?您能建议一种避免这种情况的方法吗?这个问题?”

The book had me use the swadesh corpus to create a 'translator', as follows: 这本书让我使用swadesh语料库来创建“翻译器”,如下所示:

`from nltk.corpus import swadesh
fr2en = swadesh.entries(['fr', 'en'])
de2en = swadesh.entries(['de', 'en'])
es2en = swadesh.entries(['es', 'en'])
translate = dict(fr2en)
translate.update(dict(de2en))
translate.update(dict(es2en))`

One problem I saw was that when you translate the German word for dog (hund) to English, it only takes the uppercase form: translate['Hund'] returns 'dog' , while translate['hund'] returns KeyError: 'hund' 我看到的一个问题是,当您将德语的dog(hund)单词翻译成英语时,它仅采用大写形式: translate['Hund']返回'dog' ,而translate['hund']返回KeyError: 'hund'

Is there a way to make the translator translate words regardless of case? 有没有办法使翻译者无论大小写都能翻译单词? I've been playing around with it, like doing translate.update(dict(de2en.lower)) and what not to no avail. 我一直在玩弄它,就像在进行translate.update(dict(de2en.lower))和什么都没有用。 I feel like I'm missing something obvious. 我觉得我缺少明显的东西。 Could anyone help me? 有人可以帮我吗?

Thanks! 谢谢!

Ah, the infamous capitalization of Nouns in German (see http://german.about.com/library/weekly/aa020919a.htm ) 啊,德语名词的大写字母(请参阅http://german.about.com/library/weekly/aa020919a.htm

You could try a list comprehension and lower each token from the swadesh corpus: 您可以尝试理解列表并降低swadesh语料库中的每个标记:

>>> from nltk.corpus import swadesh
>>> de2en = [(i.lower(),j.lower()) for i,j in swadesh.entries(['de','en'])]
>>> translate = dict(de2en)
>>> translate['hund']
u'dog'
>>> translate['Hund']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'Hund'

But you would have lost the capitalization in the key. 但是您可能会丢失密钥中的大写字母。 So to resolve that you can update the translate dictionary again with the original swadesh entries: 因此,为了解决此问题,您可以使用原始swadesh条目再次更新translate词典:

>>> from nltk.corpus import swadesh
>>> de2en = [(i.lower(),j.lower()) for i,j in swadesh.entries(['de','en'])]
>>> translate = dict(de2en)
>>> translate.update(swadesh.entries(['de','en']))
>>> translate['hund']
u'dog'
>>> translate['Hund']
u'dog'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM