简体   繁体   English

是否可以在不使用原始说明符的情况下抑制Python对给定字符串的转义序列处理?

[英]Is it possible to suppress Python's escape sequence processing on a given string without using the raw specifier?

Conclusion: It's impossible to override or disable Python's built-in escape sequence processing, such that, you can skip using the raw prefix specifier. 结论:不可能覆盖或禁用Python的内置转义序列处理,因此,您可以跳过使用原始前缀说明符。 I dug into Python's internals to figure this out. 我挖掘Python的内部结构来解决这个问题。 So if anyone tries designing objects that work on complex strings (like regex) as part of some kind of framework, make sure to specify in the docstrings that string arguments to the object's __init__() MUST include the r prefix! 因此,如果有人尝试设计处理复杂字符串的对象(如正则表达式)作为某种框架的一部分,请确保在文档字符串中指定字符串参数的对象__init__() 必须包含r前缀!




Original question: I am finding it a bit difficult to force Python to not "change" anything about a user-inputted string, which may contain among other things, regex or escaped hexadecimal sequences. 原始问题:我发现强迫Python不要“改变”任何关于用户输入的字符串的内容有点困难,其中可能包含正则表达式或转义的十六进制序列。 I've already tried various combinations of raw strings, .encode('string-escape') (and its decode counterpart), but I can't find the right approach. 我已经尝试过各种原始字符串组合, .encode('string-escape') (和它的解码对应物),但我找不到合适的方法。

Given an escaped, hexadecimal representation of the Documentation IPv6 address 2001:0db8:85a3:0000:0000:8a2e:0370:7334 , using .encode() , this small script (called x.py ): 给定文档IPv6地址2001:0db8:85a3:0000:0000:8a2e:0370:7334的转义十六进制表示,使用.encode() ,这个小脚本(称为x.py ):

#!/usr/bin/env python

class foo(object):
    __slots__ = ("_bar",)
    def __init__(self, input):
        if input is not None:
            self._bar = input.encode('string-escape')
        else:
            self._bar = "qux?"

    def _get_bar(self): return self._bar
    bar = property(_get_bar)
#

x = foo("\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
print x.bar


Will yield the following output when executed: 执行时将产生以下输出:

$ ./x.py
 \x01\r\xb8\x85\xa3\x00\x00\x00\x00\x8a.\x03ps4


Note the \\x20 got converted to an ASCII space character, along with a few others. 请注意\\x20转换为ASCII空格字符以及其他一些字符。 This is basically correct due to Python processing the escaped hex sequences and converting them to their printable ASCII values. 由于Python处理转义的十六进制序列并将它们转换为可打印的ASCII值,因此这基本上是正确的。


This can be solved if the initializer to foo() was treated as a raw string (and the .encode() call removed), like this: 如果将foo()的初始化程序视为原始字符串(并删除.encode()调用),则可以解决此问题,如下所示:

x = foo(r"\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")


However, my end goal is to create a kind of framework that can be used and I want to hide these kinds of "implementation details" from the end user. 但是,我的最终目标是创建一种可以使用的框架,我想隐藏最终用户的这些“实现细节”。 If they called foo() with the above IPv6 address in escaped hexadecimal form (without the raw specifier) and immediately print it back out, they should get back exactly what they put in w/o knowing or using the raw specifier. 如果他们以转义的十六进制形式(没有原始说明符foo()使用上述IPv6地址调用foo()并立即将其打印出来,那么他们应该准确地返回他们在知道或使用原始说明符时放入的内容。 So I need to find a way to have foo 's __init__() do whatever processing is necessary to enable that. 所以我需要找到一种方法让foo__init__()做任何必要的处理来启用它。



Edit: Per this SO question , it seems it's a defect of Python, in that it always performs some kind of escape sequence processing. 编辑:根据这个问题 ,似乎它是Python的缺陷,因为它总是执行某种转义序列处理。 There does not appear to be any kind of facility to completely turn off escape sequence processing, even temporarily. 似乎没有任何设施可以完全关闭转义序列处理,即使是暂时的。 Sucks. 吮吸。 I guess I am going to have to research subclassing str to create something like rawstr that intelligently determines what escape sequences Python processed in a string, and convert them back to their original format. 我想我将不得不研究子类化str来创建类似rawstr东西,智能地确定Python在字符串中处理的转义序列,并将它们转换回原始格式。 This is not going to be fun... 这不会很有趣......


Edit2: Another example, given the sample regex below: Edit2:另一个例子,给出下面的示例正则表达式:

"^.{0}\xcb\x00\x71[\x00-\xff]"


If I assign this to a var or pass it to a function without using the raw specifier, the \\x71 gets converted to the letter q . 如果我将此赋值给var或将其传递给函数而不使用原始说明符,则\\x71将转换为字母q Even if I add .encode('string-escape') or .replace('\\\\', '\\\\\\\\') , the escape sequences are still processed . 即使我添加.encode('string-escape').replace('\\\\', '\\\\\\\\')仍会处理转义序列。 thus resulting in this output: 从而产生这样的输出:

"^.{0}\xcb\x00q[\x00-\xff]"


How can I stop this, again, without using the raw specifier? 如何在不使用原始说明符的情况下再次停止此操作? Is there some way to "turn off" the escape sequence processing or "revert" it after the fact thus that the q turns back into \\x71 ? 有没有办法“关闭”转义序列处理或“恢复”它后事实,因此q转回到\\x71 Is there a way to process the string and escape the backslashes before the escape sequence processing happens? 有没有办法在转义序列处理发生之前处理字符串并转义反斜杠?

I think you have an understandable confusion about a difference between Python string literals (source code representation), Python string objects in memory, and how that objects can be printed (in what format they can be represented in the output). 我认为你对Python字符串文字(源代码表示),内存中的Python字符串对象以及如何打印这些对象(它们可以在输出中表示的格式)之间的区别有一个可理解的混淆。

If you read some bytes from a file into a bytestring you can write them back as is. 如果从文件中读取一些字节到字节串,可以按原样写回。

r"" exists only in source code there is no such thing at runtime ie, r"\\x" and "\\\\x" are equal, they may even be the exact same string object in memory. r""仅存在于源代码中,在运行时没有这样的东西,即r"\\x""\\\\x"相等,它们甚至可能是内存中完全相同的字符串对象。

To see that input is not corrupted, you could print each byte as an integer: 要查看输入未损坏,可以将每个字节打印为整数:

print " ".join(map(ord, raw_input("input something")))

Or just echo as is (there could be a difference but it is unrelated to your "string-escape" issue): 或者只是按原样回显(可能存在差异,但它与您的"string-escape"问题无关):

print raw_input("input something")

Identity function: 身份功能:

def identity(obj):
    return obj

If you do nothing to the string then your users will receive the exact same object back. 如果您对字符串不执行任何操作,那么您的用户将收到完全相同的对象 You can provide examples in the docs what you consider a concise readable way to represent input string as Python literals. 您可以在文档中提供您认为是将输入字符串表示为Python文字的简洁可读方式的示例。 If you find confusing to work with binary strings such as "\\x20\\x01" then you could accept ascii hex-representation instead: "2001" (you could use binascii.hexlify/unhexlify to convert one to another). 如果你发现混淆使用二进制字符串,如"\\x20\\x01"那么你可以接受ascii十六进制表示: "2001" (你可以使用binascii.hexlify / unhexlify将一个转换为另一个)。


The regex case is more complex because there are two languages: 正则表达式的情况更复杂,因为有两种语言:

  1. Escapes sequences are interpreted by Python according to its string literal syntax Python根据其字符串文字语法解释转义序列
  2. Regex engine interprets the string object as a regex pattern that also has its own escape sequences 正则表达式引擎将字符串对象解释为也具有其自己的转义序列的正则表达式模式

I think you will have to go the join route. 我想你必须走加入路线。

Here's an example: 这是一个例子:

>>> m = {chr(c): '\\x{0}'.format(hex(c)[2:].zfill(2)) for c in xrange(0,256)}
>>>
>>> x = "\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34"
>>> print ''.join(map(m.get, x))
\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34

I'm not entirely sure why you need that though. 我不完全确定你为什么需要它。 If your code needs to interact with other pieces of code, I'd suggest that you agree on a defined format, and stick to it. 如果您的代码需要与其他代码进行交互,我建议您同意已定义的格式,并坚持使用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM