简体   繁体   English

在python中以二进制形式访问系统输入

[英]Access system input as binary in python

I am a fan of Python 3's foregrounding of unicode issues. 我是Python 3着重unicode问题的粉丝。 However in one place I'm not sure what it is doing. 但是在一个地方,我不确定它在做什么。

As I understand it, the argv and the environment variables are transmitted from the OS to the python executable as bytes. 据我了解,argv和环境变量以字节为单位从操作系统传输到python可执行文件。 Python chooses an encoding, and the data is exposed to the user program as unicode strings in sys.argv and os.environ . Python选择一种编码,并且数据以sys.argvos.environ unicode字符串形式显示给用户程序。

I can't figure out how python chooses this encoding. 我不知道python如何选择这种编码。 I thought it was with LC variables but that doesn't seem to work. 我以为是LC变量,但这似乎行不通。

$ printf -v CENTS '\xC2\xA2' ; export CENTS ; echo "0xC2 0xA2 in UTF-8 is $CENTS"
0xC2 0xA2 in UTF-8 is ¢
$ printf -v LBS '\xC2\xA3' ; echo "0xC2 0xA3 in UTF-8 is $LBS"
0xC2 0xA3 in UTF-8 is £
$ cat <<EOF >test.py
import os, sys
print("0xC2 0xA2 decodes to", *(hex(ord(c)) for c in os.environ.get("CENTS")))
print("0xC2 0xA3 decodes to", *(hex(ord(c)) for c in sys.argv[1]))
EOF
$ python3 test.py $LBS
0xC2 0xA2 decodes to 0xa2
0xC2 0xA3 decodes to 0xa3
$ LC_ALL=es_ES.ISO8859-1 python3 test.py $LBS
0xC2 0xA2 decodes to 0xa2
0xC2 0xA3 decodes to 0xa3

I expected the second one to give 0xc2 0xa2 and 0xc2 0xa3, but it seems LC_ALL made no difference. 我希望第二个变量给出0xc2 0xa2和0xc2 0xa3,但是LC_ALL似乎没有什么区别。

Is there any way to bypass the encoding and just see the binary data provided to the executable? 有什么方法可以绕过编码,而只能看到提供给可执行文件的二进制数据吗?

Optionally, how does Python choose an encoding and where does it expose it? 可选地,Python如何选择一种编码,它在哪里公开呢? I thought it was exposed in sys.getfilesystemencoding() but that has very sparse docs which do not clarify anything. 我以为它是在sys.getfilesystemencoding()公开的,但是它的文档非常稀疏,没有任何说明。 Pointers to official documentation would be greatly appreciated. 指向官方文档的指针将不胜感激。

Per the linked answers and the documentation they reference, here is a short answer: 根据链接的答案及其引用的文档,这里是一个简短的答案:

For os.environ , see os.environb , which is available on non-windows systems and provides direct access to the underlying bytes. 对于os.environ ,请参见os.environb ,该文件在非Windows系统上可用,并提供对基础字节的直接访问。

sys.argv is automatically decoded to a Unicode object using a specialized variety of the system-determined encoding (taken from I think LANG ), and the original bytes are not directly exposed. sys.argv使用系统确定的特殊编码(从我认为LANG )自动解码为Unicode对象,并且原始字节不直接公开。 To access them, more or less reliably I think, you can use os.fsencode . 我认为要或多或少可靠地访问它们,可以使用os.fsencode

I have a feeling this can be gamed but I will follow up on that later. 我觉得这可以解决,但我会在后面跟进。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM