简体   繁体   English

究竟什么是“原始字符串正则表达式”以及如何使用它?

[英]What exactly is a "raw string regex" and how can you use it?

From the python documentation on regex , regarding the '\' character:regex上的 python 文档中,关于'\'字符:

The solution is to use Python's raw string notation for regular expression patterns;解决方案是对正则表达式模式使用 Python 的原始字符串表示法; backslashes are not handled in any special way in a string literal prefixed with 'r' .在前缀为'r'的字符串文字中,反斜杠不会以任何特殊方式处理。 So r"\n" is a two-character string containing '\' and 'n' , while "\n" is a one-character string containing a newline.所以r"\n"是一个包含'\''n'的两个字符的字符串,而"\n"是一个包含换行符的一个字符的字符串。 Usually patterns will be expressed in Python code using this raw string notation.通常模式将使用此原始字符串表示法在 Python 代码中表示。

What is this raw string notation?这个原始字符串表示法是什么? If you use a raw string format, does that mean "*" is taken as aa literal character rather than a zero-or-more indicator?如果您使用原始字符串格式,这是否意味着"*"被视为文字字符而不是零个或多个指示符? That obviously can't be right, or else regex would completely lose its power.这显然是不对的,否则正则表达式将完全失去它的力量。 But then if it's a raw string, how does it recognize newline characters if "\n" is literally a backslash and an "n" ?但是如果它是一个原始字符串,如果"\n"字面上是一个反斜杠和一个"n" ,它如何识别换行符?

I don't follow.我不跟。

Edit for bounty:编辑赏金:

I'm trying to understand how a raw string regex matches newlines, tabs, and character sets, eg \w for words or \d for digits or all whatnot, if raw string patterns don't recognize backslashes as anything more than ordinary characters.我试图了解原始字符串正则表达式如何匹配换行符、制表符和字符集,例如, \w表示单词, \d表示数字或所有其他内容,如果原始字符串模式不能将反斜杠识别为普通字符。 I could really use some good examples.我真的可以举一些很好的例子。

Zarkonnen's response does answer your question, but not directly. Zarkonnen 的回答确实回答了您的问题,但没有直接回答。 Let me try to be more direct, and see if I can grab the bounty from Zarkonnen.让我试着更直接一些,看看我能不能从 Zarkonnen 那里抢到赏金。

You will perhaps find this easier to understand if you stop using the terms "raw string regex" and "raw string patterns".如果您停止使用术语“原始字符串正则表达式”和“原始字符串模式”,您可能会发现这更容易理解。 These terms conflate two separate concepts: the representations of a particular string in Python source code, and what regular expression that string represents.这些术语将两个独立的概念混为一谈:Python 源代码中特定字符串的表示,以及该字符串表示的正则表达式。

In fact, it's helpful to think of these as two different programming languages, each with their own syntax.事实上,将它们视为两种不同的编程语言是有帮助的,每种语言都有自己的语法。 The Python language has source code that, among other things, builds strings with certain contents, and calls the regular expression system. Python 语言有源代码,其中包括构建具有特定内容的字符串,并调用正则表达式系统。 The regular expression system has source code that resides in string objects, and matches strings.正则表达式系统具有驻留在字符串对象中并匹配字符串的源代码。 Both languages use backslash as an escape character.两种语言都使用反斜杠作为转义字符。

First, understand that a string is a sequence of characters (ie bytes or Unicode code points; the distinction doesn't much matter here).首先,要了解字符串是一个字符序列(即字节或Unicode 代码点;这里的区别并不重要)。 There are many ways to represent a string in Python source code.在 Python 源代码中有多种表示字符串的方法。 A raw string is simply one of these representations.原始字符串只是这些表示形式之一。 If two representations result in the same sequence of characters, they produce equivalent behaviour.如果两种表示产生相同的字符序列,它们会产生相同的行为。

Imagine a 2-character string, consisting of the backslash character followed by the n character.想象一个 2 字符的字符串,由反斜杠字符后跟n字符组成。 If you know that the character value for backslash is 92, and for n is 110, then this expression generates our string:如果你知道反斜杠的字符值是 92, n是 110,那么这个表达式会生成我们的字符串:

s = chr(92)+chr(110)
print len(s), s

2 \n

The conventional Python string notation "\n" does not generate this string.传统的 Python 字符串表示法"\n"不会生成此字符串。 Instead it generates a one-character string with a newline character.相反,它会生成一个带有换行符的单字符字符串。 The Python docs 2.4.1. Python 文档2.4.1。 String literals say, "The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character." 字符串文字说,“反斜杠 (\) 字符用于转义具有特殊含义的字符,例如换行符、反斜杠本身或引号字符。”

s = "\n"
print len(s), s

1 
 

(Note that the newline isn't visible in this example, but if you look carefully, you'll see a blank line after the "1".) (注意在这个例子中换行符是不可见的,但是如果你仔细看,你会在“1”之后看到一个空行。)

To get our two-character string, we have to use another backslash character to escape the special meaning of the original backslash character:为了得到我们的两个字符的字符串,我们必须使用另一个反斜杠字符来转义原始反斜杠字符的特殊含义:

s = "\\n"
print len(s), s

2 \n

What if you want to represent strings that have many backslash characters in them?如果要表示其中包含许多反斜杠字符的字符串怎么办? Python docs 2.4.1. Python 文档2.4.1。 String literals continue, "String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences." 字符串文字继续,“字符串文字可以选择以字母'r'或'R'作为前缀;这样的字符串被称为原始字符串,并使用不同的规则来解释反斜杠转义序列。” Here is our two-character string, using raw string representation:这是我们的两个字符的字符串,使用原始字符串表示:

s = r"\n"
print len(s), s

2 \n

So we have three different string representations, all giving the same string, or sequence of characters:所以我们有三种不同的字符串表示,都给出相同的字符串或字符序列:

print chr(92)+chr(110) == "\\n" == r"\n"
True

Now, let's turn to regular expressions.现在,让我们转向正则表达式。 The Python docs, 7.2. Python 文档,7.2。 reRegular expression operations says, "Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python's usage of the same character for the same purpose in string literals..." re正则表达式操作说,“正则表达式使用反斜杠字符 ('\') 来指示特殊形式或允许使用特殊字符而不调用其特殊含义。这与 Python 出于相同目的使用相同字符相冲突在字符串文字中......”

If you want a Python regular expression object which matches a newline character, then you need a 2-character string, consisting of the backslash character followed by the n character.如果你想要一个匹配换行符的 Python 正则表达式对象,那么你需要一个 2 字符的字符串,由反斜杠字符和n字符组成。 The following lines of code all set prog to a regular expression object which recognises a newline character:以下代码行都将prog设置为识别换行符的正则表达式对象:

prog = re.compile(chr(92)+chr(110))
prog = re.compile("\\n")
prog = re.compile(r"\n")

So why is it that "Usually patterns will be expressed in Python code using this raw string notation."那么为什么“通常模式将使用这种原始字符串表示法在 Python 代码中表达”。 ? ? Because regular expressions are frequently static strings, which are conveniently represented as string literals.因为正则表达式通常是静态字符串,可以方便地表示为字符串文字。 And from the different string literal notations available, raw strings are a convenient choice, when the regular expression includes a backslash character.从可用的不同字符串文字符号中,当正则表达式包含反斜杠字符时,原始字符串是一个方便的选择。

Questions问题

Q : what about the expression re.compile(r"\s\tWord") ?:表达式re.compile(r"\s\tWord")怎么样? A : It's easier to understand by separating the string from the regular expression compilation, and understanding them separately. A : 把字符串从正则表达式编译中分离出来,分别理解比较容易理解。

s = r"\s\tWord"
prog = re.compile(s)

The string s contains eight characters: a backslash , an s , a backslash , a t , and then four characters Word .字符串s包含八个字符:一个反斜杠、一个s 、一个反斜杠、一个t ,然后是四个字符Word

Q : What happens to the tab and space characters?:制表符和空格字符会怎样? A : At the Python language level, string s doesn't have tab and space character. A : 在 Python 语言级别,字符串s没有制表符空格字符。 It starts with four characters: backslash , s , backslash , t .它以四个字符开头:反斜杠s反斜杠t The regular expression system, meanwhile, treats that string as source code in the regular expression language, where it means "match a string consisting of a whitespace character, a tab character, and the four characters Word .同时,正则表达式系统将该字符串视为正则表达式语言中的源代码,这意味着“匹配由空格字符、制表符和四个字符Word组成的字符串。

Q : How do you match those if that's being treated as backlash-s and backslash-t?:如果将其视为 backlash-s 和 backslash-t,您如何匹配它们? A : Maybe the question is clearer if the words 'you' and 'that' are made more specific: how does the regular expression system match the expressions backlash-s and backslash-t? A : 如果“you”和“that”这两个词更具体,也许问题会更清楚:正则表达式系统如何匹配表达式 backlash-s 和 backslash-t? As 'any whitespace character' and as ' tab character'.作为“任何空白字符”和“制表符”。

Q : Or what if you have the 3-character string backslash-n-newline?:或者如果你有 3 个字符的字符串 backslash-n-newline 怎么办? A : In the Python language, the 3-character string backslash-n-newline can be represented as conventional string "\\n\n" , or raw plus conventional string r"\n" "\n" , or in other ways. A : 在 Python 语言中,3 个字符的字符串 backslash-n-newline 可以表示为常规字符串"\\n\n" ,或原始加常规字符串r"\n" "\n" ,或其他方式. The regular expression system matches the 3-character string backslash-n-newline when it finds any two consecutive newline characters.正则表达式系统在找到任意两个连续的换行符时匹配 3 个字符的字符串 backslash-n-newline。

NB All examples and document references are to Python 2.7.注意:所有示例和文档引用均指向 Python 2.7。

Update : Incorporated clarifications from answers of @Vladislav Zorov and @m.buettner, and from follow-up question of @Aerovistae.更新:合并了@Vladislav Zorov 和@m.buettner 的回答以及@Aerovistae 的后续问题的澄清。

Most of these questions have a lot of words in them and maybe it's hard to find the answer to your specific question.这些问题中的大多数都有很多词,也许很难找到您特定问题的答案。

If you use a regular string and you pass in a pattern like "\t" to the RegEx parser, Python will translate that literal into a buffer with the tab byte in it (0x09).如果您使用常规字符串并将“\t”之类的模式传递给 RegEx 解析器,Python 会将该文字转换为包含制表符字节(0x09)的缓冲区。

If you use a raw string and you pass in a pattern like r"\t" to the RegEx parser, Python does not do any interpretation, and it creates a buffer with two bytes in it: '\', and 't'.如果您使用原始字符串并将 r"\t" 之类的模式传递给 RegEx 解析器,Python 不会进行任何解释,它会创建一个包含两个字节的缓冲区:'\' 和 't'。 (0x5c, 0x74). (0x5c,0x74)。

The RegEx parser knows what to do with the sequence '\t' -- it matches that against a tab. RegEx 解析器知道如何处理序列 '\t' - 它会将其与制表符匹配。 It also knows what to do with the 0x09 character -- that also matches a tab.它还知道如何处理 0x09 字符——它也匹配一个制表符。 For the most part, the results will be indistinguishable.在大多数情况下,结果将无法区分。

So the key to understanding what's happening is recognizing that there are two parsers being employed here.因此,了解正在发生的事情的关键是认识到这里使用了两个解析器。 The first one is the Python parser, and it translates your string literal (or raw string literal) into a sequence of bytes.第一个是 Python 解析器,它将您的字符串文字(或原始字符串文字)转换为字节序列。 The second one is Python's regular expression parser, and it converts a sequence of bytes into a compiled regular expression.第二个是 Python 的正则表达式解析器,它将字节序列转换为已编译的正则表达式。

The issue with using a normal string to write regexes that contain a \ is that you end up having to write \\ for every \ .使用普通字符串编写包含\的正则表达式的问题是您最终必须为每个\编写\\ So the string literals "stuff\\things" and r"stuff\things" produce the same string.所以字符串文字"stuff\\things"r"stuff\things"产生相同的字符串。 This gets especially useful if you want to write a regular expression that matches against backslashes.如果您想编写与反斜杠匹配的正则表达式,这将特别有用。

Using normal strings, a regexp that matches the string \ would be "\\\\" !使用普通字符串,匹配字符串\的正则表达式将是"\\\\"

Why?为什么? Because we have to escape \ twice: once for the regular expression syntax, and once for the string syntax.因为我们必须转义\两次:一次用于正则表达式语法,一次用于字符串语法。

You can use triple quotes to include newlines, like this:您可以使用三引号来包含换行符,如下所示:

r'''stuff\
things'''

Note that usually, python would treat \ -newline as a line continuation, but this is not the case in raw strings.请注意,通常,python 会将\ -newline 视为行继续,但在原始字符串中并非如此。 Also note that backslashes still escape quotes in raw strings, but are left in themselves.另请注意,反斜杠仍会转义原始字符串中的引号,但会保留在其自身中。 So the raw string literal r"\"" produces the string \" .所以原始字符串文字r"\""产生字符串\" This means you can't end a raw string literal with a backslash.这意味着您不能以反斜杠结束原始字符串文字。

See the lexical analysis section of the Python documentation for more information.有关更多信息,请参阅Python 文档的词法分析部分

You seem to be struggling with the idea that a RegEx isn't part of Python, but instead a different programming language with its own parser and compiler.您似乎在为 RegEx 不是 Python 的一部分,而是一种具有自己的解析器和编译器的不同编程语言的想法而苦苦挣扎。 Raw strings help you get the "source code" of a RegEx safely to the RegEx parser, which will then assign meaning to character sequences like \d , \w , \n , etc...原始字符串可帮助您将 RegEx 的“源代码”安全地提供给 RegEx 解析器,然后解析器将为\d\w\n等字符序列分配含义......

The issue exists because Python and RegExps use \ as escape character, which is, by the way, a coincidence - there are languages with other escape characters (like "`n" for a newline, but even there you have to use "\n" in RegExps).问题的存在是因为 Python 和 RegExps 使用\作为转义字符,顺便说一下,这是一个巧合 - 有些语言带有其他转义字符(例如“`n”作为换行符,但即使在那里你也必须使用“\n “在正则表达式中)。 The advantage is that you don't need to differentiate between raw and non-raw strings in these languages, they won't both try to convert the text and butcher it, because they react to different escape sequences.优点是您不需要区分这些语言中的原始字符串和非原始字符串,它们不会同时尝试转换文本并对其进行处理,因为它们对不同的转义序列做出反应。

The relevant Python manual section ("String and Bytes literals") has a clear explanation of raw string literals:相关的 Python 手册部分(“字符串和字节文字”)对原始字符串文字有明确的解释:

Both string and bytes literals may optionally be prefixed with a letter 'r' or 'R';字符串和字节文字都可以选择以字母“r”或“R”作为前缀; such strings are called raw strings and treat backslashes as literal characters.此类字符串称为原始字符串,并将反斜杠视为文字字符。 As a result, in string literals, '\U' and '\u' escapes in raw strings are not treated specially.因此,在字符串文字中,原始字符串中的 '\U' 和 '\u' 转义不会被特殊处理。 Given that Python 2.x's raw unicode literals behave differently than Python 3.x's the 'ur' syntax is not supported.鉴于 Python 2.x 的原始 unicode 文字的行为与 Python 3.x 的不同,不支持 'ur' 语法。

New in version 3.3: The 'rb' prefix of raw bytes literals has been added as a synonym of 'br'. 3.3 版中的新功能:添加了原始字节文字的“rb”前缀作为“br”的同义词。

New in version 3.3: Support for the unicode legacy literal (u'value') was reintroduced to simplify the maintenance of dual Python 2.x and 3.x codebases. 3.3 版中的新功能:重新引入了对 unicode 传统文字 (u'value') 的支持,以简化双 Python 2.x 和 3.x 代码库的维护。 See PEP 414 for more information.有关详细信息,请参阅 PEP 414。

In triple-quoted strings, unescaped newlines and quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the string.在三引号字符串中,允许(并保留)未转义的换行符和引号,除非一行中的三个未转义的引号终止字符串。 (A “quote” is the character used to open the string, ie either ' or ".) (“引号”是用于打开字符串的字符,即 ' 或 "。)

Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:除非存在 'r' 或 'R' 前缀,否则字符串中的转义序列将根据与标准 C 使用的规则类似的规则进行解释。可识别的转义序列是:

Escape Sequence Meaning Notes转义序列含义注释

\newline Backslash and newline ignored \newline 反斜杠和换行符被忽略
\ Backslash () \ 反斜杠 ()
\' Single quote (') \' 单引号 (')
\" Double quote (") \" 双引号 (")
\a ASCII Bell (BEL) \a ASCII 钟 (BEL)
\b ASCII Backspace (BS) \b ASCII 退格 (BS)
\f ASCII Formfeed (FF) \f ASCII 换页符 (FF)
\n ASCII Linefeed (LF) \n ASCII 换行 (LF)
\r ASCII Carriage Return (CR) \r ASCII 回车 (CR)
\t ASCII Horizontal Tab (TAB) \v ASCII Vertical Tab (VT) \t ASCII 水平制表符 (TAB) \v ASCII 垂直制表符 (VT)
\ooo Character with octal value ooo (1,3) \ooo 八进制值 ooo (1,3) 的字符
\xhh Character with hex value hh (2,3) \xhh 十六进制值 hh (2,3) 的字符

Escape sequences only recognized in string literals are:仅在字符串文字中识别的转义序列是:

Escape Sequence Meaning Notes \N{name} Character named name in the Unicode database (4) \uxxxx Character with 16-bit hex value xxxx (5) \Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (6)转义序列 含义 注释 \N{name} Unicode 数据库中名为 name 的字符 (4) \uxxxx 具有 16 位十六进制值的字符 xxxx (5) \Uxxxxxxxx 具有 32 位十六进制值的字符 xxxxxxxx (6)

Notes:笔记:

  1. As in Standard C, up to three octal digits are accepted.与标准 C 中一样,最多接受三个八进制数字。

  2. Unlike in Standard C, exactly two hex digits are required.与标准 C 不同,需要两个十六进制数字。

  3. In a bytes literal, hexadecimal and octal escapes denote the byte with the given value.在字节文字中,十六进制和八进制转义表示具有给定值的字节。 In a string literal, these escapes denote a Unicode character with the given value.在字符串文字中,这些转义表示具有给定值的 Unicode 字符。

  4. Changed in version 3.3: Support for name aliases [1] has been added.在 3.3 版更改:添加了对名称别名 [1] 的支持。

  5. Individual code units which form parts of a surrogate pair can be encoded using this escape sequence.可以使用此转义序列对构成代理对部分的各个代码单元进行编码。 Exactly four hex digits are required.需要四个十六进制数字。

  6. Any Unicode character can be encoded this way, but characters outside the Basic Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is compiled to use 16-bit code units (the default).任何 Unicode 字符都可以通过这种方式编码,但如果 Python 编译为使用 16 位代码单元(默认值),则基本多语言平面 (BMP) 之外的字符将使用代理对进行编码。 Exactly eight hex digits are required.正好需要八个十六进制数字。

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, ie, the backslash is left in the string.与标准 C 不同,所有无法识别的转义序列都保留在字符串中,即,反斜杠保留在字符串中。 (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.) It is also important to note that the escape sequences only recognized in string literals fall into the category of unrecognized escapes for bytes literals. (此行为在调试时很有用:如果转义序列输入错误,结果输出更容易被识别为损坏。)还需要注意的是,仅在字符串文字中识别的转义序列属于无法识别的字节转义类别文字。

Even in a raw string, string quotes can be escaped with a backslash, but the backslash remains in the string;即使在原始字符串中,字符串引号也可以用反斜杠转义,但反斜杠保留在字符串中; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw string cannot end in a single backslash (since the backslash would escape the following quote character). Note also that a single backslash followed by a newline is interpreted as those two characters as part of the string, not as a line continuation.例如,r"\"" 是由两个字符组成的有效字符串文字:反斜杠和双引号;r"\" 不是有效的字符串文字(即使原始字符串也不能以奇数个反斜杠结尾)。具体来说,原始字符串不能以单个反斜杠结尾(因为反斜杠会转义后面的引号字符)。另请注意,单个反斜杠后跟换行符被解释为这两个字符作为字符串的一部分,而不是作为续行.

\n is an Escape Sequence in Python \n是 Python 中的转义序列

\w is a Special Sequence in (Python) Regex \w是(Python)正则表达式中的特殊序列

They look like they are in the same family but they are not.他们看起来像在同一个家庭,但他们不是。 Raw string notation will affect Escape Sequences but not Regex Special Sequences.原始字符串表示法将影响转义序列,但不会影响正则表达式特殊序列。

For more about Escape Sequences search for "\newline" https://docs.python.org/3/reference/lexical_analysis.html有关转义序列的更多信息,请搜索“\newline” https://docs.python.org/3/reference/lexical_analysis.html

For more about Special Sequences: search for "\number" https://docs.python.org/3/library/re.html有关特殊序列的更多信息:搜索“\number” https://docs.python.org/3/library/re.html

raw string does not affect special sequences in python regex such as \w, \d.原始字符串不会影响 python 正则表达式中的特殊序列,例如 \w、\d。 It only affects escape sequences such as \n.它只影响转义序列,例如 \n。 So most of the time it doesn't matter we write r in front or not.所以大多数时候我们在前面写 r 与否都没有关系。

I think that is the answer most beginners are looking for.我认为这是大多数初学者正在寻找的答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用'|' 在 Python 正则表达式中的整个原始字符串的一部分 - how to use '|' in part of the whole raw string in Python Regex 如何用调用的实例注释原始字符串正则表达式 - How to notate raw string regex with a called instance “u”和“r”字符串前缀究竟做了什么,什么是原始字符串文字? - What exactly do "u" and "r" string prefixes do, and what are raw string literals? 如何使用正则表达式获取可以更改的字符串 - How to use regex to get a string that can change 我可以用什么正则表达式从这个字符串中捕获组? - What regex can I use to capture groups from this string? 确保指令中的字符串正是您要返回的字符串 - Make sure the string from the instruction is exactly what you are returning Python:你究竟如何取一个字符串,将其拆分、反转并重新连接在一起? - Python: How exactly can you take a string, split it, reverse it and join it back together again? 如何使用带有字符串变量的 latex 的原始字符串? - How to use a raw string for latex with a string variable? 如何在python 3中使用raw_unicode_escape编码打印字符串? - How can you print a string using raw_unicode_escape encoding in python 3? getattr() 到底是什么,我该如何使用它? - What is getattr() exactly and how do I use it?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM