简体   繁体   English

Escaping 正则表达式字符串

[英]Escaping regex string

I want to use input from a user as a regex pattern for a search over some text.我想使用来自用户的输入作为正则表达式模式来搜索某些文本。 It works, but how I can handle cases where user puts characters that have meaning in regex?它有效,但我如何处理用户在正则表达式中放置有意义的字符的情况?

For example, the user wants to search for Word (s) : regex engine will take the (s) as a group.例如,用户要搜索 Word (s) :正则表达式引擎会将(s)视为一个组。 I want it to treat it like a string "(s)" .我希望它像字符串"(s)"一样对待它。 I can run replace on user input and replace the ( with \( and the ) with \) but the problem is I will need to do replace for every possible regex symbol.我可以在用户输入上运行replace并将(替换为\( )替换为\)但问题是我需要替换每个可能的正则表达式符号。

Do you know some better way?你知道更好的方法吗?

Use the re.escape() function for this:为此使用re.escape()函数:

4.2.3 re Module Contents 4.2.3 re模块内容

escape(string)转义(字符串)

Return string with all non-alphanumerics backslashed;返回所有非字母数字反斜杠的字符串; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.如果您想匹配其中可能包含正则表达式元字符的任意文字字符串,这很有用。

A simplistic example, search any occurence of the provided string optionally followed by 's', and return the match object.一个简单的示例,搜索任何出现的提供的字符串(可选地后跟“s”),并返回匹配对象。

def simplistic_plural(word, text):
    word_or_plural = re.escape(word) + 's?'
    return re.match(word_or_plural, text)

You can use re.escape() :您可以使用re.escape()

re.escape(string) Return string with all non-alphanumerics backslashed; re.escape(string) 返回所有非字母数字反斜杠的字符串; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.如果您想匹配其中可能包含正则表达式元字符的任意文字字符串,这很有用。

>>> import re
>>> re.escape('^a.*$')
'\\^a\\.\\*\\$'

If you are using a Python version < 3.7, this will escape non-alphanumerics that are not part of regular expression syntax as well.如果您使用的 Python 版本 < 3.7,这将转义属于正则表达式语法的非字母数字。

If you are using a Python version < 3.7 but >= 3.3, this will escape non-alphanumerics that are not part of regular expression syntax, except for specifically underscore ( _ ).如果您使用的 Python 版本 < 3.7 但 >= 3.3,则这将转义属于正则表达式语法的非字母数字,特别是下划线 ( _ )除外

Unfortunately, re.escape() is not suited for the replacement string:不幸的是, re.escape()不适合替换字符串:

>>> re.sub('a', re.escape('_'), 'aa')
'\\_\\_'

A solution is to put the replacement in a lambda:一种解决方案是将替换放在 lambda 中:

>>> re.sub('a', lambda _: '_', 'aa')
'__'

because the return value of the lambda is treated by re.sub() as a literal string.因为 lambda 的返回值被re.sub()视为文字字符串。

The answer of Owen can lead to inconsistencies.欧文的回答可能会导致不一致。 A lambda should just be an inline replacement for a function call, but it produces different results as shown below. lambda 应该只是函数调用的内联替换,但它会产生不同的结果,如下所示。 When somebody would have to 'upgrade' the lambda to a function call, for instance to build in some extra complexity, this would suddenly break down:当有人不得不将 lambda 升级为函数调用时,例如为了构建一些额外的复杂性,这会突然崩溃:

import re

xml = """pre@mytag@123@/mytag@post"""

replacewith = '@mytag@456 \\1@/mytag@'

regexp = re.compile(r'@mytag@(.*?)@/mytag@', re.S|re.M|re.I)

def rw(inp):

  return inp

result = regexp.sub(lambda _: replacewith, xml)

print(result) # desired result

result = regexp.sub(rw(replacewith), xml)

print(result) # undesired result

Usually escaping the string that you feed into a regex is such that the regex considers those characters literally.通常,将您输入正则表达式的字符串转义,使得正则表达式从字面上考虑这些字符。 Remember usually you type strings into your compuer and the computer insert the specific characters.请记住,通常您在计算机中键入字符串,然后计算机插入特定字符。 When you see in your editor \n it's not really a new line until the parser decides it is.当您在编辑器中看到\n时,它并不是真正的新行,直到解析器决定它是。 It's two characters.是两个字符。 Once you pass it through python's print will display it and thus parse it as a new a line but in the text you see in the editor it's likely just the char for backslash followed by n.一旦你通过 python 的print传递它,它就会显示它,从而将它解析为一个新的行,但在你在编辑器中看到的文本中,它可能只是反斜杠的字符,后跟 n。 If you do \r"\n" then python will always interpret it as the raw thing you typed in (as far as I understand).如果您这样做\r"\n"那么python将始终将其解释为您输入的原始内容(据我所知)。 To complicate things further there is another syntax/grammar going on with regexes.更复杂的是,正则表达式还有另一种语法/语法。 The regex parser will interpret the strings it's receives differently than python's print would.正则表达式解析器将解释它收到的字符串与 python 的打印不同。 I believe this is why we are recommended to pass raw strings like r"(\n+) -- so that the regex receives what you actually typed. However, the regex will receive a parenthesis and won't match it as a literal parenthesis unless you tell it to explicitly using the regex's own syntax rules . For that you need r"(\fun \( x : nat \) :)" here the first parens won't be matched since it's a capture group due to lack of backslashes but the second one will be matched as literal parens.我相信这就是为什么我们建议传递像r"(\n+)这样的原始字符串,以便正则表达式接收您实际输入的内容。但是,正则表达式将收到一个括号并且不会将其作为文字括号匹配,除非您告诉它明确使用正则表达式自己的语法规则。为此,您需要r"(\fun \( x : nat \) :)"此处的第一个括号将不匹配,因为由于缺少反斜杠,它是一个捕获组但第二个将作为文字括号匹配。

Thus we usually do re.escape(regex) to escape things we want to be interpreted literally ie things that would be usually ignored by the regex paraser eg parens, spaces etc. will be escaped.因此,我们通常做re.escape(regex)来逃避我们想要按字面解释的东西,即通常会被正则表达式解析器忽略的东西,例如括号、空格等将被转义。 eg code I have in my app:例如我在我的应用程序中的代码:

    # escapes non-alphanumeric to help match arbitrary literal string, I think the reason this is here is to help differentiate the things escaped from the regex we are inserting in the next line and the literal things we wanted escaped.
    __ppt = re.escape(_ppt)  # used for e.g. parenthesis ( are not interpreted as was to group this but literally

eg see these strings:例如看到这些字符串:

_ppt
Out[4]: '(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
__ppt
Out[5]: '\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
print(rf'{_ppt=}')
_ppt='(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
print(rf'{__ppt=}')
__ppt='\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'

the double backslashes I believe are there so that the regex receives a literal backslash.我相信存在双反斜杠,以便正则表达式接收文字反斜杠。


btw, I am surprised it printed double backslashes instead of a single one.顺便说一句,我很惊讶它打印了双反斜杠而不是单个反斜杠。 If anyone can comment on that it would be appreciated.如果有人可以对此发表评论,将不胜感激。 I'm also curious how to match literal backslashes now in the regex.我也很好奇现在如何在正则表达式中匹配文字反斜杠。 I assume it's 4 backslashes but I honestly expected only 2 would have been needed due to the raw string r construct.我假设它是 4 个反斜杠,但老实说,由于原始字符串 r 构造,我预计只需要 2 个。

Please give a try:请试一试:

\Q and \E as anchors \Q 和 \E 作为锚点

Put an Or condition to match either a full word or regex.放置一个 Or 条件来匹配一个完整的单词或正则表达式。

Ref Link : How to match a whole word that includes special characters in regex参考链接: 如何匹配包含正则表达式中特殊字符的整个单词

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM