简体   繁体   English

解码/编码问题

[英]decode/encode problems

I currently have serious problems with coding/encoding under Linux (Ubuntu). 我目前在Linux(Ubuntu)下编码/编码时遇到严重问题。 I never needed to deal with that before, so I don't have any idea why this actually doesn't work! 我以前从来不需要处理它,所以我不知道为什么这实际上不起作用!

I'm parsing *.desktop files from /usr/share/applications/ and extracting information which is shown in the Web browser via a HTTPServer. 我正在从/ usr / share / applications /解析*.desktop文件,并提取信息,该信息通过HTTPServer在Web浏览器中显示。 I'm using jinja2 for templating. 我正在使用jinja2进行模板制作。

First, I received UnicodeDecodeError at the call to jinja2.Template.render() which said that 首先,我在jinja2.Template.render()的调用中收到UnicodeDecodeError ,该jinja2.Template.render()表示

utf-8 cannot decode character XXX at position YY [...]

So I have made all values that come from my appfind -module (which parses the *.desktop files) returning only unicode-strings. 因此,我将所有来自appfind -module(解析*.desktop文件)的值都appfind仅返回unicode字符串。

The problem at this place was solved so far, but at some point I am writing a string returned by a function to the BaseHTTPServer.BaseHTTTPRequestHandler.wfile slot, and I can't get this error fixed, no matter what encoding I use. 到目前为止,此位置的问题已解决,但是在某个时候,我正在将函数返回的字符串写入BaseHTTPServer.BaseHTTTPRequestHandler.wfile插槽,无论使用哪种编码,我都无法解决此错误。

At this point, the string that is written to wfile comes from jinja2.Template.render() which, afaik, returns a unicode object. 此时,写入wfile的字符串来自jinja2.Template.render() ,后者afaik返回一个unicode对象。

The bizarre part is, that it is working on my Ubuntu 12.04 LTS but not on my friend's Ubuntu 11.04 LTS . 奇怪的是,它可以在我的Ubuntu 12.04 LTS上运行,而不能在我朋友的Ubuntu 11.04 LTS上运行 However, that might not be the reason. 但是,这可能不是原因。 He has a lot more applications and maybe they do use encodings in their *.desktop files that raise the error. 他拥有更多的应用程序,也许他们确实在*.desktop文件中使用了编码,从而引发了错误。

However, I properly checked for the encoding in the *.desktop files: 但是,我正确地检查了*.desktop文件中的编码:

data = dict(parser.items('Desktop Entry'))

try:
    encoding = data.get('encoding', 'utf-8')
    result = {
        'name':       data['name'].decode(encoding),
        'exec':       DKENTRY_EXECREPL.sub('', data['exec']).decode(encoding),
        'type':       data['type'].decode(encoding),
        'version':    float(data.get('version', 1.0)),
        'encoding':   encoding,
        'comment':    data.get('comment', '').decode(encoding) or None,
        'categories': _filter_bool(data.get('categories', '').
                                        decode(encoding).split(';')),
        'mimetypes':  _filter_bool(data.get('mimetype', '').
                                        decode(encoding).split(';')),
    }

# ...

Can someone please enlighten me about how I can fix this error? 有人可以启发我如何解决此错误吗? I have read on an answer on SO that I should use unicode() always, but that would be so much pain to implemented, and I don't think it would fix the problem when writing to wfile ? 我已经在SO上阅读了一个答案,我应该一直使用unicode() ,但是实现起来会很痛苦,而且我认为写入wfile不会解决问题吗?

Thanks, 谢谢,
Niklas 尼克拉斯

This is probably obvious, but anyway: wfile is an ordinary byte stream: everything written must be unicode.encode():ed when written to it. 这可能很明显,但是无论如何:wfile是一个普通的字节流:写入的所有内容在写入时都必须为unicode.encode():ed。

Reading OP, it is not clear to me what, exactly is afoot. 在阅读OP时,我不清楚正在做什么。 However, there are some tricks that may help you, that I have found to be helpful to debug encoding problems. 但是,有些技巧可能会对您有所帮助,我发现这些技巧有助于调试编码问题。 I appologize in advance if this is stuff you have long since transcended. 如果这是您早已超越的东西,我事先表示歉意。

  • cat -v on a file will output all non-ascii characters as '^X' which is the only fool-proof way I have found to decide what encoding a file really has. 文件上的cat -v将所有非ASCII字符输出为'^ X',这是我发现决定文件真正编码方式的唯一可靠方法。 UTF-8 non-ascii characters are multi-byte. UTF-8非ASCII字符是多字节。 That means that they will be sequences of more than one '^'-entry by cat -v . 这意味着它们将是cat -v一个以上'^'条目的序列。

  • Shell environment (LC_ALL, et al) is in my experience the most common cause of problems. 根据我的经验,shell环境(LC_ALL等)是导致问题的最常见原因。 Make sure you have a system that has locales with both UTF-8 and eg latin-1 available. 确保您的系统同时具有UTF-8和latin-1可用的语言环境。 Always set your LC_ALL to a locale that explicitly names an encoding, eg LC_ALL=sv_SE.iso88591 . 始终将LC_ALL设置为明确命名编码的语言环境,例如LC_ALL=sv_SE.iso88591

  • In bash and zsh, you can run a command with specific environment changes for that command, like so: 在bash和zsh中,您可以运行具有特定环境更改的命令,如下所示:

     $ LC_ALL=sv_SE.utf8 python ./foo.py 

    This makes it a lot easier to test than having to export different locales, and you won't pollute the shell. 与不必导出不同的语言环境相比,这使测试变得容易得多,并且不会污染外壳。

  • Don't assume that you have unicode strings internally. 不要以为您内部有unicode字符串。 Write assert statements that verify that strings are unicode. 编写断言语句以验证字符串是否为unicode。

     assert isinstance(foo, unicode) 
  • Learn to recognize mangled/misrepresented versions of common characters in the encodings you are working with. 学会识别正在使用的编码中常见字符的错误拼写/歪曲版本。 Eg '\\xe4' is latin-1 a diaresis and 'ä' are the two UTF-8 bytes, that make up a diaresis, misstakenly represented in latin-1. 例如,'\\ xe4'是latin-1的一个diaresis,而'Ã'是组成一个diaresis的两个UTF-8字节,在latin-1中被错误地表示。 I have found that knowing this sort of gorp cuts debugging encoding issues considerably. 我发现了解这种gorp可以大大减少调试编码问题。

You need to take a disciplined approach to your byte strings and Unicode strings. 您需要对字节字符串和Unicode字符串采取规范的方法。 This explains it all: Pragmatic Unicode, or, How Do I Stop the Pain? 这就解释了这一切: 实用Unicode,或如何停止疼痛?

By default, when python hits an encoding issue with unicde, it throws an error. 默认情况下,当python遇到unicde编码问题时,它将引发错误。 However, this behavior can be modified, such as if the error is expected or not important. 但是,可以更改此行为,例如预期错误或不重要。

Say you are converting between two unicode pages that are supersets of ascii. 假设您要在两个ascii的超集的unicode页面之间进行转换。 The both have mostly the same characters, but there is no one-to-one correspondence. 两者大多具有相同的字符,但没有一对一的对应关系。 Therefore, you would want to ignore errors. 因此,您将要忽略错误。

To do so, use the errors variable in the encode function. 为此,请使用编码函数中的errors变量。

mystring = u'This is a test'
print mystring.encode('utf-8', 'ignore')
print mystring.encode('utf-8', 'replace')
print mystring.encode('utf-8', 'xmlcharrefreplace')
print mystring.encode('utf-8', 'backslashreplace')

There are lots of issues with unicode if the wrong encodings are used when reading/writing. 如果在读取/写入时使用了错误的编码,则unicode存在很多问题。 Make sure that after you get the unicode string, you convert it to the form of unicode desired by jinja2 . 确保获得unicode字符串后,将其转换为jinja2所需unicode形式。

If this doesn't help, could you please add the second error you see, with perhaps a code snippet to clarify what's going on? 如果这样做没有帮助,请您添加您看到的第二个错误,并附上代码片段以阐明正在发生的事情?

尝试在代码段中所有出现的情况下使用.encode(encoding)而不是.decode(encoding)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM