简体   繁体   English

从 sys.stdin 读取管道输入时如何防止“UnicodeDecodeError”?

[英]How to prevent "UnicodeDecodeError" when reading piped input from sys.stdin?

I am reading some mainly HEX input into a Python3 script.我正在将一些主要的十六进制输入读取到 Python3 脚本中。 However, the system is set to use UTF-8 and when piping from Bash shell into the script, I keep getting the following UnicodeDecodeError error :但是,系统设置为使用UTF-8 ,当从 Bash shell 管道UTF-8到脚本时,我不断收到以下UnicodeDecodeError 错误

UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)

I'm using sys.stdin.read() in Python3 to read the piped input, according to other SO answers, like this:根据其他 SO 答案,我在sys.stdin.read()使用sys.stdin.read()来读取管道输入,如下所示:

import sys
...
isPipe = 0
if not sys.stdin.isatty() :
    isPipe = 1
    try:
        inpipe = sys.stdin.read().strip()
    except UnicodeDecodeError as e:
        err_unicode(e)
...

It works when piping using this way:它在使用这种方式管道时起作用:

# echo "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
<output all ok!>

However, using the raw format doesn't:但是,使用原始格式不会:

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1"

    ▒▒▒
   ▒▒

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)

and also tried other promising SO answers:并尝试了其他有希望的 SO 答案:

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "open(1,'w').write(open(0).read())"
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "from io import open; open(1,'w').write(open(0).read())"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

From what I have learned so far, is that when your terminal is encountering a UTF-8 sequence, it is expecting it to be followed by 1-3 other bytes, like this:从我目前了解到的情况是,当您的终端遇到UTF-8序列时,它希望它后面跟着 1-3 个其他字节,如下所示:

UTF-8 is a variable width character encoding capable of encoding all valid code points in Unicode using one to four 8-bit bytes . UTF-8 是一种可变宽度字符编码,能够使用一到四个8 位字节对 Unicode 中的所有有效代码点进行编码。 So anything after the leading byte (first UTF-8 character in range of 0xC2 - 0xF4 ) to be followed by 1-3 continuation bytes , in the range 0x80 - 0xBF .所以(在第一范围UTF-8字符前导字节之后的任何0xC2 - 0xF4 )应遵循的1-3延续字节,取值范围为0x80 - 0xBF

However, I cannot always be sure where my input stream come from, and it may very well be raw data and not the ASCII HEX'ed versions as above.但是,我不能总是确定我的输入流来自哪里,它很可能是原始数据,而不是上面的 ASCII 十六进制版本。 So I need to deal with this raw input somehow.所以我需要以某种方式处理这个原始输入。

I've looked at a few alternatives, like:我查看了一些替代方案,例如:

But I don't know if or how they could read a piped input stream, like I want.但我不知道他们是否或如何像我想要的那样读取管道输入流。

How can I make my script handle also a raw byte stream?如何让我的脚本也处理原始字节流?

PS.附注。 Yes, I have read loads of similar SO issues, but none of them are adequately dealing with this UTF-8 input error.是的,我已经阅读了大量类似的 SO 问题,但没有一个能够充分处理这个 UTF-8 输入错误。 The best one is this one .最好的就是这个

This is not a duplicate.这不是重复的。

I finally managed to work around this issue by not using sys.stdin !我终于通过使用sys.stdin设法解决了这个问题!

Instead I used with open(0, 'rb') .相反,我使用with open(0, 'rb') Where:在哪里:

  • 0 is the file pointer equivalent to stdin . 0是等效于stdin的文件指针。
  • 'rb' is using binary mode for reading . 'rb'使用二进制模式读取.

This seem to circumvent the issues with the system trying to interpret your locale character in the pipe.这似乎规避了系统尝试在管道中解释您的语言环境字符的问题。 I got the idea after seeing that the following worked, and returned the correct (non-printable) characters:看到以下内容后,我有了这个想法,并返回了正确的(不可打印的)字符:

echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "with open(0, 'rb') as f: x=f.read(); import sys; sys.stdout.buffer.write(x);"

▒▒▒
   ▒▒

So to correctly read any pipe data, I used:所以为了正确读取任何管道数据,我使用了:

if not sys.stdin.isatty() :
    try:
        with open(0, 'rb') as f: 
            inpipe = f.read()

    except Exception as e:
        err_unknown(e)        
    # This can't happen in binary mode:
    #except UnicodeDecodeError as e:
    #    err_unicode(e)
...

That will read your pipe data into a python byte string .这会将您的管道数据读入 python字节字符串

The next problem was to determine whether or not the pipe data was coming from a character string (like echo "BADDATA0" ) or from a binary stream .下一个问题是确定管道数据是来自字符串(如echo "BADDATA0" )还是来自二进制流 The latter can be emulated by echo -ne "\\xBA\\xDD\\xAT\\xA0" as shown in OP.后者可以通过echo -ne "\\xBA\\xDD\\xAT\\xA0" ,如OP所示。 In my case I just used a RegEx to look for out of bounds non ASCII characters.就我而言,我只是使用 RegEx 来查找越界的非 ASCII 字符。

if inpipe :
    rx = re.compile(b'[^0-9a-fA-F ]+') 
    r = rx.findall(inpipe.strip())
    if r == [] :
        print("is probably a HEX ASCII string")
    else:
        print("is something else, possibly binary")

Surely this could be done better and smarter.当然,这可以做得更好,更聪明。 (Feel free to comment!) (欢迎评论!)


Addendum: (from here )附录:(这里

mode is an optional string that specifies the mode in which the file is opened. mode是一个可选字符串,用于指定打开文件的模式。 It defaults to r which means open for reading in text mode.它默认为r ,表示以文本模式打开阅读。 In text mode , if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.文本模式下,如果未指定编码,则使用的编码取决于平台: locale.getpreferredencoding(False)以获取当前区域设置编码。 (For reading and writing raw bytes use binary mode and leave encoding unspecified.) The default mode is 'r' (open for reading text, synonym of 'rt'). (对于读取和写入原始字节,使用二进制模式并且不指定编码。)默认模式是 'r'(打开读取文本,'rt' 的同义词)。 For binary read-write access, the mode w+b opens and truncates the file to 0 bytes.对于二进制读写访问,模式w+b打开并将文件截断为 0 字节。 r+b opens the file without truncation. r+b不截断地打开文件。

... Python distinguishes between binary and text I/O. ... Python 区分二进制和文本 I/O。 Files opened in binary mode (including b in the mode argument) return contents as bytes objects without any decoding.以二进制模式打开的文件(包括 mode 参数中的b )将内容作为字节对象返回,无需任何解码。 In text mode (the default, or when t is included in the mode argument), the contents of the file are returned as str , the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.在文本模式下(默认情况下,或当t包含在模式参数中时),文件的内容作为str返回,首先使用平台相关编码或使用指定编码(如果给定)解码的字节。

If closefd is False and a file descriptor rather than a filename was given, the underlying file descriptor will be kept open when the file is closed.如果closefdFalse并且给出了文件描述符而不是文件名,则在文件关闭时底层文件描述符将保持打开状态。 If a filename is given, closefd must be True (the default) otherwise an error will be raised.如果给出了文件名,则closefd必须为True (默认值),否则将引发错误。

Here is a hacky way to read stdin in binary like a file:这是一种像文件一样以二进制形式读取 stdin 的hacky方法:

import sys

with open(sys.stdin.fileno(), mode='rb', closefd=False) as stdin_binary:
    raw_input = stdin_binary.read()
try:
    # text is the string formed by decoding raw_input as unicode
    text = raw_input.decode('utf-8')
except UnicodeDecodeError:
    # raw_input is not valid unicode, do something else with it

使用sys.stdin.buffer.raw而不是sys.stdin

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM