简体   繁体   English

如何使此Python2.6函数与Unicode一起使用?

[英]How can I make this Python2.6 function work with Unicode?

I've got this function, which I modified from material in chapter 1 of the online NLTK book. 我具有此功能,可以从在线NLTK书籍第1章的资料中修改。 It's been very useful to me but, despite reading the chapter on Unicode, I feel just as lost as before. 这对我非常有用,但是,尽管阅读了有关Unicode的章节,但我仍然感到迷茫。

def openbookreturnvocab(book):
    fileopen = open(book)
    rawness = fileopen.read()
    tokens = nltk.wordpunct_tokenize(rawness)
    nltktext = nltk.Text(tokens)
    nltkwords = [w.lower() for w in nltktext]
    nltkvocab = sorted(set(nltkwords))
    return nltkvocab

When I tried it the other day on Also Sprach Zarathustra, it clobbered words with an umlat over the o's and u's. 前几天当我在Also Sprach Zarathustra上尝试过该功能时,它在O和U上加了个半音,从而使单词变得晦涩难懂。 I'm sure some of you will know why that happened. 我敢肯定,你们中的一些人会知道为什么会这样。 I'm also sure that it's quite easy to fix. 我也相信它很容易修复。 I know that it just has to do with calling a function that re-encodes the tokens into unicode strings. 我知道这只与调用将令牌重新编码为unicode字符串的函数有关。 If so, that it seems to me it might not happen inside that function definition at all, but here, where I prepare to write to file: 如果是这样,在我看来,它可能根本不会在该函数定义内发生,而是在这里,我准备写入文件:

def jotindex(jotted, filename, readmethod):
    filemydata = open(filename, readmethod)
    jottedf = '\n'.join(jotted)
    filemydata.write(jottedf)
    filemydata.close()
    return 0

I heard that what I had to do was encode the string into unicode after reading it from the file. 我听说我需要做的是从文件中读取字符串后将其编码为unicode。 I tried amending the function like so: 我试图像这样修改功能:

def openbookreturnvocab(book):
    fileopen = open(book)
    rawness = fileopen.read()
    unirawness = rawness.decode('utf-8')
    tokens = nltk.wordpunct_tokenize(unirawness)
    nltktext = nltk.Text(tokens)
    nltkwords = [w.lower() for w in nltktext]
    nltkvocab = sorted(set(nltkwords))
    return nltkvocab

But that brought this error, when I used it on Hungarian. 但是,当我在匈牙利语上使用它时,就带来了这个错误。 When I used it on German, I had no errors. 当我在德语上使用它时,我没有任何错误。

>>> import bookroutines
>>> elles1 = bookroutines.openbookreturnvocab("lk1-les1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bookroutines.py", line 9, in openbookreturnvocab
    nltktext = nltk.Text(tokens)
  File "/usr/lib/pymodules/python2.6/nltk/text.py", line 285, in __init__
    self.name = " ".join(map(str, tokens[:8])) + "..."
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)

I fixed the function that files the data like so: 我修复了归档数据的函数,如下所示:

def jotindex(jotted, filename, readmethod):
    filemydata = open(filename, readmethod)
    jottedf = u'\n'.join(jotted)
    filemydata.write(jottedf)
    filemydata.close()
    return 0

However, that brought this error, when I tried to file the German: 但是,当我尝试归档德语时,这带来了此错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bookroutines.py", line 23, in jotindex
    filemydata.write(jottedf)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 414: ordinal not in range(128)
>>> 

...which is what you get when you try to write the u'\\n'.join'ed data. ...这就是您尝试写入u'\\ n'.join'ed数据时得到的结果。

>>> jottedf = u'/n'.join(elles1)
>>> filemydata.write(jottedf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 504: ordinal not in range(128)

For each string that you read from your file, you can convert them to unicode by calling rawness.decode('utf-8') , if you have the text in UTF-8. 对于从文件中读取的每个字符串,如果文本使用UTF-8,则可以通过调用rawness.decode('utf-8')将它们转换为unicode。 You will end up with unicode objects. 您将最终得到unicode对象。 Also, I don't know what "jotted" is, but you may want to make sure it's a unicode object and use u'\\n'.join(jotted) instead. 另外,我不知道“ jotted”是什么,但是您可能要确保它是一个unicode对象,并改用u'\\n'.join(jotted)

Update: 更新:

It appears that the NLTK library doesn't like unicode objects. 看来NLTK库不喜欢unicode对象。 Fine, then you have to make sure that you are using str instances with UTF-8 encoded text. 很好,那么您必须确保您使用的是带有UTF-8编码文本的str实例。 Try using this: 尝试使用此:

tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text([token.encode('utf-8') for token in tokens])

and this: 和这个:

jottedf = u'\n'.join(jotted)
filemydata.write(jottedf.encode('utf-8'))

but if jotted is really a list of UTF-8-encoded str, then you don't need this and this should be enough: 但是如果jotted实际上是UTF-8编码的str的列表,则您不需要它,这应该足够了:

jottedf = '\n'.join(jotted)
filemydata.write(jottedf)

By the way, it looks as though NLTK isn't very cautious with respect to unicode and encoding (at least, the demos). 顺便说一句,对于Unicode和编码(至少是演示),NLTK似乎不太谨慎。 Better be careful and check that it has processed your tokens correctly. 最好小心一点,并检查它是否正确处理了令牌。 Also, and this may have caused the fact that you get errors with Hungarian text and not German text, check your encodings . 另外,这可能会导致您遇到匈牙利文字而不是德语文字的错误, 请检查编码

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM