python html从命令行转义utf-8

Question

我编写了一个简单的支持脚本，该脚本将字符串从stdin转换为htmlencoded版本：

#!/usr/bin/env python
import cgi
import fileinput


for line in fileinput.input():
    print cgi.escape(line).encode('ascii', 'xmlcharrefreplace')

这正是我需要的：

$ echo "AA<>BB"|htmlescape
AA&lt;&gt;BB

但是，当输入包含一些简单的非ascii字符时，该工具将崩溃：

$ echo "AA<>BBeëCC"|htmlescape
Traceback (most recent call last):
  File "/home/remco/bin/htmlescape", line 7, in <module>
    print cgi.escape(line).encode('ascii', 'xmlcharrefreplace')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128)

如何使脚本接受非ASCII字符？

Answer 1

您正在尝试编码字节串：

>>> 'AA<>BBeëCC'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

Python试图提供帮助，您只能将Unicode字符串编码为字节，因此首先要使用默认编解码器ASCII隐式解码 Python。

您必须先使用合适的编解码器进行显式解码。 因为您使用的是管道，所以Python无法检测到输入编解码器。 您必须在Python本身中将其显式设置为命令行选项或使用环境变量。

假设您的控制台已配置为使用UTF-8，则可以使用：

print cgi.escape(line.decode('utf8')).encode('ascii', 'xmlcharrefreplace')

演示：

>>> 'AA<>BBeëCC'.decode('utf8').encode('ascii', 'xmlcharrefreplace')
'AA<>BBe&#235;CC'

您可以使用locale.getpreferredencoding()函数来内省终端编解码器配置：

#!/usr/bin/env python
import cgi
import fileinput
import locale


codec = locale.getpreferredencoding()


for line in fileinput.input():
    line = line.decode(codec)
    print cgi.escape(line).encode('ascii', 'xmlcharrefreplace')

这样，您始终可以匹配终端用来接受输入的任何编解码器，还可以使用环境变量来设置编解码器：

LC_CTYPE='en_US.ISO-8859-1 echo "latin text" | htmlescape

告诉Python使用Latin-1编解码器进行解码。

python html从命令行转义utf-8

问题描述

1 个解决方案

解决方案1
2 2015-03-23 10:23:03

python html从命令行转义utf-8

问题描述

1 个解决方案

解决方案1 2 2015-03-23 10:23:03

解决方案1
2 2015-03-23 10:23:03