简体   繁体   中英

Why python does not behave the same when printing unicode strings in console and pipes?

After a couple tests, I managed to restrict a misfunctionning in the minimal test.py script below:

# -*- coding: iso-8859-1 -*-
print u"Vérifier l'affichage de cette chaîne"

Note: test.py is encoded in ISO-8859-1 (ie latin-1), ie "é" equals "\\xe9" and "î" equals "\\xee"

D:\test>python --version
Python 2.7.3
D:\test>python test.py
Vérifier l'affichage de cette chaîne
D:\test>python test.py > test.log
Traceback (most recent call last):
  File "test.py", line 2, in <module>
    print u"VÚrifier l'affichage de cette cha¯ne"
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)

Here is the question :

How come python does not behave the same when printing unicode strings whether its standard output goes to the console or is redirected or piped to something else?

First, ISO-8859-1 isn't a valid coding declaration. You want iso-8859-1 . If you look at the docs , you can call this latin_1 , iso-8859-1 , iso8859-1 , 8859 , cp819 , latin , latin1 , or L1 , but not ISO-8859-1 .

It looks like codecs.lookup bends over backward to accept bad input, including doing case-insensitive lookups. If you trace codecs.lookup through _codecs.lookup to _PyCodec_Lookup , you can see this comment:

/* Convert the encoding to a normalized Python string: all
   characters are converted to lower case, spaces and hyphens are
   replaced with underscores. */

But source file decoding doesn't go through the same codec lookup process. Because it happens at compile time rather than runtime, there's no reason for it to do so. (At any rate, saying "It seems to work, even though the docs say it's wrong… so why doesn't it quite work right?" is kind of silly in the first place.)

To demonstrate, if I create two Latin-1 files:

badcode.py:

# -*- coding: ISO-8859-1 -*-
print u"Vérifier l'affichage de cette chaîne"

goodcode.py:

# -*- coding: iso-8859-1 -*-
print u"Vérifier l'affichage de cette chaîne"

The first one fails, the second succeeds.

Now, why does it "work" when it's going to console but raise an exception when piped?

Well, when you print to a Windows console, or a Unix TTY, Python has some code to try to guess the right encoding to use. (I'm not sure what happens under the covers on Windows; it might even be using UTF-16 output, for all I know.) When you're not printing to a console/TTY, it can't do this, so you have to specify the encoding explicitly.

You can see some of what's going on by looking at sys.stdout.isatty() , sys.stdout.encoding , and sys.getdefaultencoding() . Here's what I see on a Mac in different cases:

  • Python 2, no redirect: True, UTF-8, ascii, Vérifier
  • Python 3, no redirect: True, UTF-8, utf-8, Vérifier
  • Python 2, redirect: False, None, ascii, UnicodeEncodeError
  • Python 3, redirect: False, UTF-8, utf-8, Vérifier

If isatty() , encoding will be an appropriate encoding for the TTY; otherwise, encoding will be the default value, which is None (meaning ascii ) in 2.x, and (I think—I'd have to check the code) something based on getdefaultencoding() in 3.x. Which means that if you try to print Unicode while stdout is not a TTY in 2.x, it will try to encode it as ascii , strict , which will fail if you've got non-ASCII characters.

If you somehow know what codec you want to use, you can deal with this manually by checking isatty() and encoding to that codec (or even to ascii , ignore instead of strict , if you prefer) whenever you print, instead of trying to print Unicode. (If you know what codec you want, you may want to do this even in 3.x—defaulting to UTF-8 isn't too helpful if you're trying to generate, say, Windows-1252 files…)

The difference there actually has nothing to do with Latin-1. Try this out:

nocode.py:

print u"V\xe9rifier l'affichage de cette cha\xeene"
print u"V\u00e9rifier l'affichage de cette cha\u00eene"

I get the Unicode strings encoded to UTF-8 for my Mac terminal, and (apparently) Windows-1252 to my Windows cmd window, but an exception redirecting to a file.

Since I came here looking for the "don't be smart" switch to python's print() and the answer provides hints to read-only variables, here's the "make python believe stdout can handle utf-8" snippet:

import sys, codecs

# somewhere in the function you need it or global main():
sys.stdout = codecs.open('/dev/stdout', encoding='utf-8', mode='w', errors='strict')

There, now python doesn't care if it's a tty, tee(1), file redirection or just cat(1) for the heck of it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM