简体   繁体   English

Python:从stdin读取时的UnicodeEncodeError

[英]Python: UnicodeEncodeError when reading from stdin

When running a Python program that reads from stdin, I get the following error: 运行从stdin读取的Python程序时,出现以下错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 320: ordinal not in range(128)

How can I fix it? 我该如何解决?

Note: The error occurs internal to antlr and the line looks like that: 注意:错误发生在antlr内部,行看起来像:

        self.strdata = unicode(data)

Since I don't want to modify the source code, I'd like to pass in something that is acceptable. 由于我不想修改源代码,我想传递一些可接受的内容。

The input code looks like that: 输入代码如下所示:

#!/usr/bin/python
import sys
import codecs
import antlr3
import antlr3.tree
from LatexLexer import LatexLexer
from LatexParser import LatexParser


char_stream = antlr3.ANTLRInputStream(codecs.getreader("utf8")(sys.stdin))
lexer = LatexLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = LatexParser(tokens)
r = parser.document()

The problem is, that when reading from stdin, python decodes it using the system default encoding: 问题是,当从stdin读取时,python使用系统默认编码对其进行解码:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

The input is very likely UTF-8 or Windows-CP-1252, so the program chokes on non-ASCII-characters. 输入很可能是UTF-8或Windows-CP-1252,因此程序会阻塞非ASCII字符。

To convert sys.stdin to a stream with the proper decoder, I used: 要使用正确的解码器将sys.stdin转换为流,我使用了:

import codecs
char_stream = codecs.getreader("utf-8")(sys.stdin)

That fixed the problem. 这解决了问题。

BTW, this is the method ANTLRs FileStream uses to open a file with given filename (instead of a given stream): 顺便说一句,这是ANTLRs FileStream用于打开具有给定文件名(而不是给定流)的文件的方法:

    fp = codecs.open(fileName, 'rb', encoding)
    try:
        data = fp.read()
    finally:
        fp.close()

BTW #2: For strings I found BTW#2:对于我发现的字符串

a_string.encode(encoding) 

useful. 有用。

You're not getting this error on input, you're getting this error when trying to output the read data. 您在输入时没有收到此错误,在尝试输出读取数据时会出现此错误。 You should be decoding data you read, and throwing the unicodes around instead of dealing with bytestrings the whole time. 你应该解码你读取的数据,然后抛出unicodes,而不是一直处理字节串。

Here is an excellent writedown about how Python handles encodings: 这是关于Python如何处理编码的优秀减记:

How to use UTF-8 with Python 如何在Python中使用UTF-8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM