简体   繁体   中英

Python 'ascii' encode problems in print statement

System: python 3.4.2 on linux.

I'm woring on a django application (irrelevant), and I encountered a problem that it throws

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

when print is called (!). After quite a bit of digging, I discovered I should check

>>> sys.getdefaultencoding()
'utf-8'

but it was as expected, utf8. I noticed also that os.path.exists throws the same exception when used with a unicode string. So I checked

>>> sys.getfilesystemencoding()
'ascii'

When I used LANG=en_US.UTF-8 the issue disappeared. I understand now why os.path.exists had problems with that. But I have absolutely no clue why print statement is affected by the filesystem setting. Is there a third setting I'm missing? Or does it just assume LANG environment is to be trusted for everything?

Also... I don't get the reasoning here. LANG does not tell what encoding is supported by the filenames. It has nothing to do with that. It's set separately for the current environment, not for the filesystem. Why is python using this setting for filesystem filenames? It makes applications very fragile, as all the file operations just break when run in an environment where LANG is not set or set to C (not uncommon, especially when a web-app is run as root or a new user created specifically for the daemon).

Test code (no actual unicode input needed to avoid terminal encoding pitfalls):

x=b'\xc4\x8c\xc5\xbd'
y=x.decode('utf-8')
print(y)

Question:

  • is there a good and accepted way of making the application robust to the LANG setting?
  • is there any real-world reason to guess the filesystem capabilities from environment instead of the filesystem driver?
  • why is print affected?

LANG is used to determine your locale ; if you don't set specific LC_ variables the LANG variable is used as the default.

The filesystem encoding is determined by the LC_CTYPE variable , but if you haven't set that variable specifically, the LANG environment variable is used instead.

Printing uses sys.stdout , a textfile configured with the codec your terminal uses. Your terminal settings is also locale specific; your LANG variable should really reflect what locale your terminal is set to. If that is UTF-8, you need to make sure your LANG variable reflects that. sys.stdout uses locale.getpreferredencoding(False) (like all text streams opened without an explicit encoding set) and on POSIX systems that'll use LC_CTYPE too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM