简体   繁体   English

以下正则表达式如何工作?

[英]How does the following regex work?

Let's say I have a string in which I wanted to parse from an opening double-quote to a closing double-quote: 假设我有一个字符串,我想从一个开头的双引号解析为结束双引号:

asdf"pass\"word"asdf

I was lucky enough to discover that the following PCRE would match from the opening double-quote to the closing double-quote while ignoring the escaped double-quote in the middle (to properly parse the logical unit): 我很幸运地发现以下PCRE会从开头双引号到结束双引号匹配,同时忽略中间的转义双引号(正确解析逻辑单元):

".*?(?:(?!\\").)"

Match: 比赛:

"pass\"word"

However, I have no idea why this PCRE matches the opening and closing double-quote properly. 但是,我不知道为什么这个PCRE正确地匹配开始和结束双引号。

I know the following: 我知道以下内容:

" = literal double-quote “=字面双引号

.*? 。*? = lazy matching of zero or more of any character =任何字符的零或更多的惰性匹配

(?: = opening of non-capturing group (?:=打开非捕获组

(?!\\") = asserts its impossible to match literal \\" (?!\\“)=断言它不可能匹配文字\\”

. = single character =单个字符

) = closing of non-capturing group )=关闭非捕获组

" = literal double-quote “=字面双引号

It appears that a single character and a negative lookahead are apart of the same logical group. 看起来单个字符和负前瞻是同一逻辑组的一部分。 To me , this means the PCRE is saying "Match from a double-quote to zero or more of any character as long as there is no \\" right after the character, then match one more character and one single double quote." 对我来说,这意味着PCRE会说“只要字符后面没有\\”,就可以从双引号到零或更多的字符匹配,然后再匹配一个字符和一个双引号。“

However, according to that logic the PCRE would not match the string at all. 但是,根据该逻辑,PCRE根本不匹配字符串。

Could someone help me wrap my head around this? 有人可以帮助我绕过这个吗?

It's easier to understand if you change the non-capture group to be a capture group . 如果将非捕获组更改为捕获组,则更容易理解。

Lazy matching generally moves forward one character at a time (vs. greedy matching everything it can and then giving up what it must). 懒惰匹配通常一次向前移动一个角色(相对于贪婪匹配它可以然后放弃它必须的东西)。 But it "moves forward" as far as satisfying the required parts of the pattern after it, which is accomplished by letting the .*? 但是,只要满足模式之后所需的部分,它就会“前进”,这是通过让.*? match everything up to r , then letting the negative lookahead + . 将所有内容与r匹配,然后让负面预测+ . match the d . 匹配d

Update: you asked in comment: 更新:您在评论中提问:

how come it matches up to the r at all? 它怎么会与r匹配呢? shouldn't the negative lookahead prevent it from getting passed the \\" in the string? thanks for helpin me understand, by the way 不应该是消极的前瞻阻止它在字符串中传递\\"感谢帮助我理解,顺便说一下

No, because it is not the negative lookahead stuff that is matching it. 不,因为它不是匹配它的负面前瞻性东西。 That is why I suggested you change the non-captured group into a captured group, so that you can see it is .*? 这就是为什么我建议您将未捕获的组更改为捕获的组,以便您可以看到它.*? that matches the \\" , not (?:(?!\\\\").) 匹配\\" ,而不是(?:(?!\\\\").)

.*? has the potential to match the entire string, and the regex engine uses that to satisfy the requirement to match the rest of the pattern. 有可能匹配整个字符串,正则表达式引擎使用它来满足匹配模式其余部分的要求。

Update 2: 更新2:

It is effectively the same as doing this: ".*?[^\\\\]" which is probably a lot easier to wrap your head around. 它实际上与执行此操作相同: ".*?[^\\\\]"这可能更容易包裹你的头脑。

A (slightly) better pattern would be to use a negative lookbehind like so: ".*?(?<!\\\\)" because it will allow for an empty string "" to be matched (a valid match in many contexts), but negative lookbehinds aren't supported in all engines/languages (from your tags, pcre supports it, but I don't think you can really do this in bash except eg grep -P '[pattern]' .. which basically runs it through perl). 一个(略微)更好的模式是使用负面的lookbehind如下: ".*?(?<!\\\\)"因为它将允许匹配一个空字符串"" (在许多上下文中有效匹配),但是所有引擎/语言都不支持负面的lookbehinds(从你的标签,pcre支持它,但我认为你不能在bash中真正做到这一点,除了例如grep -P '[pattern]' ..它基本上运行它通过perl)。

Nothing to add to Crayon Violent explanation, only a little disambiguation and ways to match substrings enclosed between double quotes (with eventually quotes escaped by a backslash inside). 没有什么可以添加到Crayon Violent的解释,只有一点消歧和方法来匹配双引号之间的子串(最终引号被反斜杠内部转义)。

First, it seems that you use in your question the acronym "PCRE" (Perl Compatible Regular Expression) that is the name of a particular regex engine (and by extension or somewhat imprecisely refers to its syntax) in place of the word "pattern" that is the regular expression that describes a group of other strings (whatever the regex engine used). 首先,您似乎在您的问题中使用了首字母缩略词“PCRE”(Perl Compatible Regular Expression),它是特定正则表达式引擎的名称(并且通过扩展或有些不精确地指代其语法)来代替单词“pattern”这是描述一组其他字符串的正则表达式(无论使用何种正则表达式引擎)。

With Bash: 使用Bash:

A='asdf"pass\"word"asdf'
pattern='"(([^"\\]|\\.)*)"'

[[ $A =~ $pattern ]]
echo ${BASH_REMATCH[1]}

You can use this pattern too: pattern='"(([^"\\\\]+|\\\\.)*)"' 您也可以使用此模式: pattern='"(([^"\\\\]+|\\\\.)*)"'

With a PCRE regex engine, you can use the first pattern, but it's better to rewrite it in a more efficient way: 使用PCRE正则表达式引擎,您可以使用第一种模式,但最好以更有效的方式重写它:

"([^"\\]*+(?:\\.[^"\\])*+)"

Note that for these three patterns don't need any lookaround. 请注意,对于这三种模式,不需要任何环视。 They are able to deal with any number of consecutive backslashes: "abc\\\\\\"def" (a literal backslash and an escaped quote) , "abcdef\\\\\\\\" (two literal backslashes, the quote is not escaped) . 他们能够处理任意数量的连续反斜杠: "abc\\\\\\"def" (字面反斜杠和转义引号)"abcdef\\\\\\\\" (两个字面反斜杠,引号未转义)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM