简体   繁体   English

Python 3字节在CGI脚本中使用非ASCII字符进行解码

[英]Python 3 bytes decode with non-ascii characters in CGI script

I have a very short sample code: 我有一个很短的示例代码:

print("Content-Type: text/plain; charset=utf-8")
print("Access-Control-Allow-Origin: *")
print()

x = 'Chloë'.encode()
print(x)
print(x.decode())

Notice non Ascii ë , which is the source of all problems. 注意非Ascii码E,这是所有问题的根源。

Calling the script in bash using python3 ./test.py produces following (correct) input: 使用python3 ./test.py在bash中调用脚本会产生以下(正确)输入:

Content-Type: text/plain; charset=utf-8
Access-Control-Allow-Origin: *

b'Chlo\xc3\xab'
Chloë

However calling it from the browser, the last line is not present (headers of course aren't visible, but they are present). 但是,从浏览器中调用它时,最后一行不存在(当然标题不可见,但存在)。 So the only visible part is: 因此,唯一可见的部分是:

b'Chlo\xc3\xab'

Do you know, where could be a problem? 您知道哪里可能有问题吗?

You are printing Unicode to the sys.stdout handle (which is the default file object print() writes to). 您正在将Unicode打印到sys.stdout句柄(这是print()写入的默认文件对象)。 That object then has to encode your data again, but it has to do so based on the environment that it is connected to. 然后,该对象必须再次对您的数据进行编码,但是它必须根据所连接的环境进行编码。

When you run python3 ./test.py then you connected to your terminal or console, and it is usually configured to tell scripts what codec is appropriate. 当您运行python3 ./test.py您已连接到终端或控制台,通常将其配置为告诉脚本哪种编解码器合适。 On POSIX systems (Linux, Mac) you can run the locale command to see what that configuration is. 在POSIX系统(Linux,Mac)上,您可以运行locale命令来查看该配置是什么。 In your console locale there is no problem displaying a non-ASCII codepoint like ë . 在您的控制台语言环境中,显示ë类的非ASCII代码点没有问题。

But when running as a CGI script connected to a webserver, there is no such language configuration present, and Python almost certainly has fallen back to the lowest common denominator instead: ASCII. 但是,当以连接到Web服务器的CGI脚本运行时,不存在这样的语言配置,Python几乎可以肯定已经降到了最低的公分母:ASCII。 And when this is the case, trying to print non-Unicode text will result in an exception: 在这种情况下,尝试打印非Unicode文本将导致异常:

$ LC_ALL="en_US.UTF-8" python3 -c "print(b'Chlo\xc3\xab'.decode())"
Chloë
$ LC_ALL="C" python3 -c "print(b'Chlo\xc3\xab'.decode())"  # C => "no locale set"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xeb' in position 4: ordinal not in range(128)

Because the exception takes place only after producing headers and all the other output, you don't see an HTTP error code. 因为该异常仅生成标头和所有其他输出之后发生,所以您看不到HTTP错误代码。 The exception should have been logged in your server error logs, however. 但是,该异常应该已经记录在服务器错误日志中。

If your script is to output UTF-8 to the browser, as configured in the Content-Type header you emit, replace sys.stdout to force that codec: 如果您的脚本要按照您发出的Content-Type标头中的配置将UTF-8输出到浏览器,请替换 sys.stdout以强制使用该编解码器:

import sys
from io import TextIOWrapper

sys.stdout = TextIOWrapper(sys.stdout.buffer.detach(), encoding='utf8')

In Python 3, text files like those used for the sys.stdout stream, contain a buffer object, which in turn contains a binary file object that takes care of the actual binary data writing. 在Python 3中,像用于sys.stdout流的文本文件一样,包含一个缓冲区对象,该缓冲区对象又包含一个二进制文件对象,该对象负责实际的二进制数据写入。 The outer text file object is only responsible for encoding on write, really. 实际上,外部文本文件对象仅负责写入时的编码。 The above replaces that outer object with a different one that always encodes to UTF-8. 上面的代码将外部对象替换为始终编码为UTF-8的其他对象。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM