decode/encode problems

Question

I currently have serious problems with coding/encoding under Linux (Ubuntu). I never needed to deal with that before, so I don't have any idea why this actually doesn't work!

I'm parsing *.desktop files from /usr/share/applications/ and extracting information which is shown in the Web browser via a HTTPServer. I'm using jinja2 for templating.

First, I received UnicodeDecodeError at the call to jinja2.Template.render() which said that

utf-8 cannot decode character XXX at position YY [...]

So I have made all values that come from my appfind -module (which parses the *.desktop files) returning only unicode-strings.

The problem at this place was solved so far, but at some point I am writing a string returned by a function to the BaseHTTPServer.BaseHTTTPRequestHandler.wfile slot, and I can't get this error fixed, no matter what encoding I use.

At this point, the string that is written to wfile comes from jinja2.Template.render() which, afaik, returns a unicode object.

The bizarre part is, that it is working on my Ubuntu 12.04 LTS but not on my friend's Ubuntu 11.04 LTS . However, that might not be the reason. He has a lot more applications and maybe they do use encodings in their *.desktop files that raise the error.

However, I properly checked for the encoding in the *.desktop files:

data = dict(parser.items('Desktop Entry'))

try:
    encoding = data.get('encoding', 'utf-8')
    result = {
        'name':       data['name'].decode(encoding),
        'exec':       DKENTRY_EXECREPL.sub('', data['exec']).decode(encoding),
        'type':       data['type'].decode(encoding),
        'version':    float(data.get('version', 1.0)),
        'encoding':   encoding,
        'comment':    data.get('comment', '').decode(encoding) or None,
        'categories': _filter_bool(data.get('categories', '').
                                        decode(encoding).split(';')),
        'mimetypes':  _filter_bool(data.get('mimetype', '').
                                        decode(encoding).split(';')),
    }

# ...

Can someone please enlighten me about how I can fix this error? I have read on an answer on SO that I should use unicode() always, but that would be so much pain to implemented, and I don't think it would fix the problem when writing to wfile ?

Thanks,
Niklas

Answer 1

This is probably obvious, but anyway: wfile is an ordinary byte stream: everything written must be unicode.encode():ed when written to it.

Reading OP, it is not clear to me what, exactly is afoot. However, there are some tricks that may help you, that I have found to be helpful to debug encoding problems. I appologize in advance if this is stuff you have long since transcended.

cat -v on a file will output all non-ascii characters as '^X' which is the only fool-proof way I have found to decide what encoding a file really has. UTF-8 non-ascii characters are multi-byte. That means that they will be sequences of more than one '^'-entry by cat -v .
Shell environment (LC_ALL, et al) is in my experience the most common cause of problems. Make sure you have a system that has locales with both UTF-8 and eg latin-1 available. Always set your LC_ALL to a locale that explicitly names an encoding, eg LC_ALL=sv_SE.iso88591 .
In bash and zsh, you can run a command with specific environment changes for that command, like so:
```
 $ LC_ALL=sv_SE.utf8 python ./foo.py 
```
This makes it a lot easier to test than having to export different locales, and you won't pollute the shell.
Don't assume that you have unicode strings internally. Write assert statements that verify that strings are unicode.
```
 assert isinstance(foo, unicode) 
```
Learn to recognize mangled/misrepresented versions of common characters in the encodings you are working with. Eg '\\xe4' is latin-1 a diaresis and 'Ã¤' are the two UTF-8 bytes, that make up a diaresis, misstakenly represented in latin-1. I have found that knowing this sort of gorp cuts debugging encoding issues considerably.

Answer 2

You need to take a disciplined approach to your byte strings and Unicode strings. This explains it all: Pragmatic Unicode, or, How Do I Stop the Pain?

Answer 3

By default, when python hits an encoding issue with unicde, it throws an error. However, this behavior can be modified, such as if the error is expected or not important.

Say you are converting between two unicode pages that are supersets of ascii. The both have mostly the same characters, but there is no one-to-one correspondence. Therefore, you would want to ignore errors.

To do so, use the errors variable in the encode function.

mystring = u'This is a test'
print mystring.encode('utf-8', 'ignore')
print mystring.encode('utf-8', 'replace')
print mystring.encode('utf-8', 'xmlcharrefreplace')
print mystring.encode('utf-8', 'backslashreplace')

There are lots of issues with unicode if the wrong encodings are used when reading/writing. Make sure that after you get the unicode string, you convert it to the form of unicode desired by jinja2 .

If this doesn't help, could you please add the second error you see, with perhaps a code snippet to clarify what's going on?

Answer 4

尝试在代码段中所有出现的情况下使用.encode(encoding)而不是.decode(encoding) 。

decode/encode problems

Question

4 answers

solution1
3 2012-06-07 19:59:34

solution2
1 ACCPTED 2012-06-08 21:50:30

solution3
0 2012-06-07 20:44:20

solution4
0 2012-06-08 21:19:28

decode/encode problems

Question

4 answers

solution1 3 2012-06-07 19:59:34

solution2 1 ACCPTED 2012-06-08 21:50:30

solution3 0 2012-06-07 20:44:20

solution4 0 2012-06-08 21:19:28

solution1
3 2012-06-07 19:59:34

solution2
1 ACCPTED 2012-06-08 21:50:30

solution3
0 2012-06-07 20:44:20

solution4
0 2012-06-08 21:19:28