[英]How to prevent "UnicodeDecodeError" when reading piped input from sys.stdin?
I am reading some mainly HEX input into a Python3 script.我正在将一些主要的十六进制输入读取到 Python3 脚本中。 However, the system is set to use
UTF-8
and when piping from Bash shell into the script, I keep getting the following UnicodeDecodeError
error :但是,系统设置为使用
UTF-8
,当从 Bash shell 管道UTF-8
到脚本时,我不断收到以下UnicodeDecodeError
错误:
UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)
I'm using sys.stdin.read()
in Python3 to read the piped input, according to other SO answers, like this:根据其他 SO 答案,我在
sys.stdin.read()
使用sys.stdin.read()
来读取管道输入,如下所示:
import sys
...
isPipe = 0
if not sys.stdin.isatty() :
isPipe = 1
try:
inpipe = sys.stdin.read().strip()
except UnicodeDecodeError as e:
err_unicode(e)
...
It works when piping using this way:它在使用这种方式管道时起作用:
# echo "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
<output all ok!>
However, using the raw format doesn't:但是,使用原始格式不会:
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1"
▒▒▒
▒▒
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)
and also tried other promising SO answers:并尝试了其他有希望的 SO 答案:
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "open(1,'w').write(open(0).read())"
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "from io import open; open(1,'w').write(open(0).read())"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
From what I have learned so far, is that when your terminal is encountering a UTF-8 sequence, it is expecting it to be followed by 1-3 other bytes, like this:从我目前了解到的情况是,当您的终端遇到UTF-8序列时,它希望它后面跟着 1-3 个其他字节,如下所示:
UTF-8 is a variable width character encoding capable of encoding all valid code points in Unicode using one to four 8-bit bytes .
UTF-8 是一种可变宽度字符编码,能够使用一到四个8 位字节对 Unicode 中的所有有效代码点进行编码。 So anything after the leading byte (first UTF-8 character in range of
0xC2 - 0xF4
) to be followed by 1-3 continuation bytes , in the range0x80 - 0xBF
.所以(在第一范围UTF-8字符前导字节之后的任何
0xC2 - 0xF4
)应遵循的1-3延续字节,取值范围为0x80 - 0xBF
。
However, I cannot always be sure where my input stream come from, and it may very well be raw data and not the ASCII HEX'ed versions as above.但是,我不能总是确定我的输入流来自哪里,它很可能是原始数据,而不是上面的 ASCII 十六进制版本。 So I need to deal with this raw input somehow.
所以我需要以某种方式处理这个原始输入。
I've looked at a few alternatives, like:我查看了一些替代方案,例如:
to use codecs.decode使用codecs.decode
to use open("myfile.jpg", "rb", buffering=0)
with raw i/o使用带有原始 i/o 的
open("myfile.jpg", "rb", buffering=0)
using bytes.decode(encoding="utf-8", errors="ignore")
from bytes使用
bytes.decode(encoding="utf-8", errors="ignore")
从 字节
But I don't know if or how they could read a piped input stream, like I want.但我不知道他们是否或如何像我想要的那样读取管道输入流。
How can I make my script handle also a raw byte stream?如何让我的脚本也处理原始字节流?
PS.附注。 Yes, I have read loads of similar SO issues, but none of them are adequately dealing with this UTF-8 input error.
是的,我已经阅读了大量类似的 SO 问题,但没有一个能够充分处理这个 UTF-8 输入错误。 The best one is this one .
最好的就是这个。
This is not a duplicate.这不是重复的。
I finally managed to work around this issue by not using sys.stdin
!我终于通过不使用
sys.stdin
设法解决了这个问题!
Instead I used with open(0, 'rb')
.相反,我使用
with open(0, 'rb')
。 Where:在哪里:
0
is the file pointer equivalent to stdin
. 0
是等效于stdin
的文件指针。'rb'
is using binary mode for reading . 'rb'
使用二进制模式读取. This seem to circumvent the issues with the system trying to interpret your locale character in the pipe.这似乎规避了系统尝试在管道中解释您的语言环境字符的问题。 I got the idea after seeing that the following worked, and returned the correct (non-printable) characters:
看到以下内容后,我有了这个想法,并返回了正确的(不可打印的)字符:
echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "with open(0, 'rb') as f: x=f.read(); import sys; sys.stdout.buffer.write(x);"
▒▒▒
▒▒
So to correctly read any pipe data, I used:所以为了正确读取任何管道数据,我使用了:
if not sys.stdin.isatty() :
try:
with open(0, 'rb') as f:
inpipe = f.read()
except Exception as e:
err_unknown(e)
# This can't happen in binary mode:
#except UnicodeDecodeError as e:
# err_unicode(e)
...
That will read your pipe data into a python byte string .这会将您的管道数据读入 python字节字符串。
The next problem was to determine whether or not the pipe data was coming from a character string (like echo "BADDATA0"
) or from a binary stream .下一个问题是确定管道数据是来自字符串(如
echo "BADDATA0"
)还是来自二进制流。 The latter can be emulated by echo -ne "\\xBA\\xDD\\xAT\\xA0"
as shown in OP.后者可以通过
echo -ne "\\xBA\\xDD\\xAT\\xA0"
,如OP所示。 In my case I just used a RegEx to look for out of bounds non ASCII characters.就我而言,我只是使用 RegEx 来查找越界的非 ASCII 字符。
if inpipe :
rx = re.compile(b'[^0-9a-fA-F ]+')
r = rx.findall(inpipe.strip())
if r == [] :
print("is probably a HEX ASCII string")
else:
print("is something else, possibly binary")
Surely this could be done better and smarter.当然,这可以做得更好,更聪明。 (Feel free to comment!)
(欢迎评论!)
Addendum: (from here )附录:(从这里)
mode is an optional string that specifies the mode in which the file is opened.
mode是一个可选字符串,用于指定打开文件的模式。 It defaults to
r
which means open for reading in text mode.它默认为
r
,表示以文本模式打开阅读。 In text mode , if encoding is not specified the encoding used is platform dependent:locale.getpreferredencoding(False)
is called to get the current locale encoding.在文本模式下,如果未指定编码,则使用的编码取决于平台:
locale.getpreferredencoding(False)
以获取当前区域设置编码。 (For reading and writing raw bytes use binary mode and leave encoding unspecified.) The default mode is 'r' (open for reading text, synonym of 'rt').(对于读取和写入原始字节,使用二进制模式并且不指定编码。)默认模式是 'r'(打开读取文本,'rt' 的同义词)。 For binary read-write access, the mode
w+b
opens and truncates the file to 0 bytes.对于二进制读写访问,模式
w+b
打开并将文件截断为 0 字节。r+b
opens the file without truncation.r+b
不截断地打开文件。... Python distinguishes between binary and text I/O.
... Python 区分二进制和文本 I/O。 Files opened in binary mode (including
b
in the mode argument) return contents as bytes objects without any decoding.以二进制模式打开的文件(包括 mode 参数中的
b
)将内容作为字节对象返回,无需任何解码。 In text mode (the default, or whent
is included in the mode argument), the contents of the file are returned as str , the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.在文本模式下(默认情况下,或当
t
包含在模式参数中时),文件的内容作为str返回,首先使用平台相关编码或使用指定编码(如果给定)解码的字节。If closefd is
False
and a file descriptor rather than a filename was given, the underlying file descriptor will be kept open when the file is closed.如果closefd为
False
并且给出了文件描述符而不是文件名,则在文件关闭时底层文件描述符将保持打开状态。 If a filename is given, closefd must beTrue
(the default) otherwise an error will be raised.如果给出了文件名,则closefd必须为
True
(默认值),否则将引发错误。
Here is a hacky way to read stdin in binary like a file:这是一种像文件一样以二进制形式读取 stdin 的hacky方法:
import sys
with open(sys.stdin.fileno(), mode='rb', closefd=False) as stdin_binary:
raw_input = stdin_binary.read()
try:
# text is the string formed by decoding raw_input as unicode
text = raw_input.decode('utf-8')
except UnicodeDecodeError:
# raw_input is not valid unicode, do something else with it
使用sys.stdin.buffer.raw
而不是sys.stdin
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.