简体   繁体   English

为什么不能在断言后面的零宽度中使用重复量词?

[英]Why can't you use repetition quantifiers in zero-width look behind assertions?

I was always under the impression that you couldn't use repetition quantifiers in zero-width assertions (Perl Compatible Regular Expressions [PCRE]). 我总是给人一种印象,您不能在零宽度的断言(Perl兼容正则表达式[PCRE])中使用重复量词。 However, it has recently transpired to me that you can use them in look ahead assertions. 但是,最近对我来说,您可以在前瞻性断言中使用它们。

How does the PCRE regex engine work when searching with zero-width look behinds which precludes repetition quantifiers from being used? PCRE正则表达式引擎在以零宽度查找时会如何工作,从而使重复量词无法使用?

Here is a simple example from a PCRE in R: 这是R中PCRE的一个简单示例:

# Our string
x <- 'MaaabcccM'

##  Does it contain a 'b', preceeded by an 'a' and followed by zero or more 'c',
##  then an 'M'?
grepl( '(?<=a)b(?=c*M)' , x , perl=T )
# [1] TRUE

##  Does it contain a 'b': (1) preceeded by an 'M' and then zero or more 'a' and
##                         (2) followed by zero or more 'c' then an 'M'?
grepl( '(?<=Ma*)b(?=c*M)' , x , perl = TRUE )
# Error in grepl("(?<=Ma*)b(?=c*M)", x, perl = TRUE) :
#   invalid regular expression '(?<M=a*)b(?=c*M)'
# In addition: Warning message:
# In grepl("(?<=Ma*)b(?=c*M)", x, perl = TRUE) : PCRE pattern compilation error
#         'lookbehind assertion is not fixed length'
#         at ')b(?=c*M)'

The ultimate answer to such a question is in the engine's code, and at the bottom of the answer you'll be able to dive into the section of the PCRE engine's code responsible for ensuring fixed-length in lookbehinds—if you're interested in knowing the finest details. 这个问题的最终答案是在引擎的代码中,并且在答案的底部,您将可以深入到PCRE引擎的代码部分,以确保在后面隐藏固定长度(如果您有兴趣)。知道最好的细节。 In the meantime, let's gradually zoom into the question from higher levels. 同时,让我们从更高的层次逐步放大问题。

Variable-Width Lookbehind vs. Infinite-Width Lookbehind 可变宽度向后看与无限宽向后看

First off, a quick clarification on terms. 首先,快速澄清术语。 A growing number of engines (including PCRE) support some form of variable-width lookbehind, where the variation falls within a determined range, for instance: 越来越多的引擎(包括PCRE)支持某种形式的可变宽度向后搜索,其变化落在确定的范围内,例如:

  • the engine knows that the width of what precedes must be within 5 to ten characters (not supported in PCRE) 引擎知道前面的宽度必须 5到10个字符之间(PCRE不支持)
  • the engine knows that the width of what precedes must be either 5 or ten character (supported in PCRE) 该引擎知道什么先于宽度必须 5 十字符(在PCRE支持)

In contrast, in infinite-width lookbehind, you can use quantified tokens such as a+ 相比之下,在无限宽的情况下,您可以使用量化标记,例如a+

Engines that Support Infinite-Width Lookbehind 支持无限宽后向搜索的引擎

For the record, these engines support infinite lookbehind: 作为记录,这些引擎支持无限的后向:

  • .NET (C#, VB.NET etc.) .NET(C#,VB.NET等)
  • Matthew Barnett's regex module for Python Matthew Barnett的Python regex模块
  • JGSoft (EditPad etc.; not available in a programming language). JGSoft(EditPad等;不适用于编程语言)。

As far as I know, they are the only ones. 据我所知,它们是唯一的。

Variable Lookbehind in PCRE PCRE中的变量后向

In PCRE, the most relevant section in the documentation is this: 在PCRE中,文档中最相关的部分是:

The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. 后置断言的内容受到限制,以使其匹配的所有字符串都必须具有固定的长度。 However, if there are several top-level alternatives, they do not all have to have the same fixed length. 但是,如果有多个顶级替代方案,则它们不必都具有相同的固定长度。

Therefore, the following lookbehind is valid: 因此,下面的回溯是有效的:

(?<=a |big )cat

However, none of these are: 但是,这些都不是:

  • (?<=a\\s?|big )cat (the sides of the alternation do not have a fixed width) (?<=a\\s?|big )cat (交替的边没有固定的宽度)
  • (?<=@{1,10})cat (variable width) (?<=@{1,10})cat (可变宽度)
  • (?<=\\R)cat ( \\R does not have a fixed-width as it can match \\n , \\r\\n , etc.) (?<=\\R)cat\\R没有固定宽度,因为它可以匹配\\n\\r\\n等。)
  • (?<=\\X)cat ( \\X does not have a fixed-width as a Unicode grapheme cluster can contain a variable number of bytes.) (?<=\\X)cat\\X没有固定宽度,因为Unicode字形簇可以包含可变数量的字节。)
  • (?<=a+)cat (clearly not fixed) (?<=a+)cat (显然不是固定的)

Lookbehind with Zero-Width Match but Infinite Repetition 零宽度匹配但无限重复的向后看

Now consider this: 现在考虑一下:

(?<=(?=@+))(cat#+)

On the face of it, this is a fixed-width lookbehind, because it can only ever find a zero-width match (defined by the lookahead (?=@++) ). 从表面上看,这是一个固定宽度的后向,因为它只能找到零宽度的匹配项(由前行(?=@++) )。 Is that a trick to get around the infinite lookbehind limitation? 绕过无限向后看限制是一种技巧吗?

No. PCRE will choke on this. 否。PCRE会对此感到窒息。 Even though the content of the lookbehind is zero-width, PCRE will not allow infinite repetition in the lookbehind. 即使后面的内容的宽度为零,PCRE也不会在后面进行无限重复。 Anywhere. 任何地方。 When the documentation says all the strings it matches must have a fixed length, it should really be: 当文档说匹配的所有字符串必须具有固定长度时,它实际上应该是:

All the strings that any of its components matches must have a fixed length. 它的任何组件匹配的所有字符串必须具有固定的长度。

Workarounds: Life without Infinite Lookbehind 解决方法:无后顾之忧的生活

In PCRE, the two main solutions to problems where infinite lookbehinds would help are \\K and capture Groups. 在PCRE中,无限回望有助于解决问题的两个主要解决方案是\\K和捕获组。

Workaround #1: \\K 解决方法#1: \\K

The \\K assertion tells the engine to drop what was matched so far from the final match it returns. \\K断言告诉引擎将匹配的内容从返回的最终匹配中删除。

Suppose you want (?<=@+)cat#+ , which is not legal in PCRE. 假设您想要(?<=@+)cat#+ ,这在PCRE中是不合法的。 Instead, you can use: 相反,您可以使用:

@+\Kcat#+

Workaround #2: Capture Groups 解决方法2:捕获组

Another way to proceed is to match whatever you would have placed in a lookbehind, and to capture the content of interest in a capture group. 进行操作的另一种方法是匹配将放置在后面的内容,并在捕获组中捕获感兴趣的内容。 You then retrieve the match from the capture group. 然后,您从捕获组中检索匹配项。

For instance, instead of the illegal (?<=@+)cat#+ , you would use: 例如,您可以使用以下代码代替非法的(?<=@+)cat#+

@+(cat#+)

In R, this could look like this: 在R中,可能看起来像这样:

matches <- regexpr("@+(cat#+)", subject, perl=TRUE);
result <- attr(matches, "capture.start")[,1]
attr(result, "match.length") <- attr(matches, "capture.length")[,1]
regmatches(subject, result)

In languages that don't support \\K , this is often the only solution. 在不支持\\K语言中,这通常是唯一的解决方案。

Engine Internals: What Does the PCRE Code Say? 引擎内部:PCRE代码怎么说?

The ultimate answer is to be found in pcre_compile.c . 最终答案可以在pcre_compile.c找到。 If you examine the code block that starts with this comment: 如果您检查以此注释开头的代码块:

If lookbehind, check that this branch matches a fixed-length string 如果向后看,请检查该分支是否匹配固定长度的字符串

You find that the grunt work is done by the find_fixedlength() function. 您会发现grunt工作是由find_fixedlength()函数完成的。

I reproduce it here for anyone who would like to dive into further details. 我在这里将其复制给任何想深入了解更多细节的人。

static int
find_fixedlength(pcre_uchar *code, BOOL utf, BOOL atend, compile_data *cd)
{
int length = -1;

register int branchlength = 0;
register pcre_uchar *cc = code + 1 + LINK_SIZE;

/* Scan along the opcodes for this branch. If we get to the end of the
branch, check the length against that of the other branches. */

for (;;)
  {
  int d;
  pcre_uchar *ce, *cs;
  register pcre_uchar op = *cc;

  switch (op)
    {
    /* We only need to continue for OP_CBRA (normal capturing bracket) and
    OP_BRA (normal non-capturing bracket) because the other variants of these
    opcodes are all concerned with unlimited repeated groups, which of course
    are not of fixed length. */

    case OP_CBRA:
    case OP_BRA:
    case OP_ONCE:
    case OP_ONCE_NC:
    case OP_COND:
    d = find_fixedlength(cc + ((op == OP_CBRA)? IMM2_SIZE : 0), utf, atend, cd);
    if (d < 0) return d;
    branchlength += d;
    do cc += GET(cc, 1); while (*cc == OP_ALT);
    cc += 1 + LINK_SIZE;
    break;

    /* Reached end of a branch; if it's a ket it is the end of a nested call.
    If it's ALT it is an alternation in a nested call. An ACCEPT is effectively
    an ALT. If it is END it's the end of the outer call. All can be handled by
    the same code. Note that we must not include the OP_KETRxxx opcodes here,
    because they all imply an unlimited repeat. */

    case OP_ALT:
    case OP_KET:
    case OP_END:
    case OP_ACCEPT:
    case OP_ASSERT_ACCEPT:
    if (length < 0) length = branchlength;
      else if (length != branchlength) return -1;
    if (*cc != OP_ALT) return length;
    cc += 1 + LINK_SIZE;
    branchlength = 0;
    break;

    /* A true recursion implies not fixed length, but a subroutine call may
    be OK. If the subroutine is a forward reference, we can't deal with
    it until the end of the pattern, so return -3. */

    case OP_RECURSE:
    if (!atend) return -3;
    cs = ce = (pcre_uchar *)cd->start_code + GET(cc, 1);  /* Start subpattern */
    do ce += GET(ce, 1); while (*ce == OP_ALT);           /* End subpattern */
    if (cc > cs && cc < ce) return -1;                    /* Recursion */
    d = find_fixedlength(cs + IMM2_SIZE, utf, atend, cd);
    if (d < 0) return d;
    branchlength += d;
    cc += 1 + LINK_SIZE;
    break;

    /* Skip over assertive subpatterns */

    case OP_ASSERT:
    case OP_ASSERT_NOT:
    case OP_ASSERTBACK:
    case OP_ASSERTBACK_NOT:
    do cc += GET(cc, 1); while (*cc == OP_ALT);
    cc += PRIV(OP_lengths)[*cc];
    break;

    /* Skip over things that don't match chars */

    case OP_MARK:
    case OP_PRUNE_ARG:
    case OP_SKIP_ARG:
    case OP_THEN_ARG:
    cc += cc[1] + PRIV(OP_lengths)[*cc];
    break;

    case OP_CALLOUT:
    case OP_CIRC:
    case OP_CIRCM:
    case OP_CLOSE:
    case OP_COMMIT:
    case OP_CREF:
    case OP_DEF:
    case OP_DNCREF:
    case OP_DNRREF:
    case OP_DOLL:
    case OP_DOLLM:
    case OP_EOD:
    case OP_EODN:
    case OP_FAIL:
    case OP_NOT_WORD_BOUNDARY:
    case OP_PRUNE:
    case OP_REVERSE:
    case OP_RREF:
    case OP_SET_SOM:
    case OP_SKIP:
    case OP_SOD:
    case OP_SOM:
    case OP_THEN:
    case OP_WORD_BOUNDARY:
    cc += PRIV(OP_lengths)[*cc];
    break;

    /* Handle literal characters */

    case OP_CHAR:
    case OP_CHARI:
    case OP_NOT:
    case OP_NOTI:
    branchlength++;
    cc += 2;
#ifdef SUPPORT_UTF
    if (utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);
#endif
    break;

    /* Handle exact repetitions. The count is already in characters, but we
    need to skip over a multibyte character in UTF8 mode.  */

    case OP_EXACT:
    case OP_EXACTI:
    case OP_NOTEXACT:
    case OP_NOTEXACTI:
    branchlength += (int)GET2(cc,1);
    cc += 2 + IMM2_SIZE;
#ifdef SUPPORT_UTF
    if (utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);
#endif
    break;

    case OP_TYPEEXACT:
    branchlength += GET2(cc,1);
    if (cc[1 + IMM2_SIZE] == OP_PROP || cc[1 + IMM2_SIZE] == OP_NOTPROP)
      cc += 2;
    cc += 1 + IMM2_SIZE + 1;
    break;

    /* Handle single-char matchers */

    case OP_PROP:
    case OP_NOTPROP:
    cc += 2;
    /* Fall through */

    case OP_HSPACE:
    case OP_VSPACE:
    case OP_NOT_HSPACE:
    case OP_NOT_VSPACE:
    case OP_NOT_DIGIT:
    case OP_DIGIT:
    case OP_NOT_WHITESPACE:
    case OP_WHITESPACE:
    case OP_NOT_WORDCHAR:
    case OP_WORDCHAR:
    case OP_ANY:
    case OP_ALLANY:
    branchlength++;
    cc++;
    break;

    /* The single-byte matcher isn't allowed. This only happens in UTF-8 mode;
    otherwise \C is coded as OP_ALLANY. */

    case OP_ANYBYTE:
    return -2;

    /* Check a class for variable quantification */

    case OP_CLASS:
    case OP_NCLASS:
#if defined SUPPORT_UTF || defined COMPILE_PCRE16 || defined COMPILE_PCRE32
    case OP_XCLASS:
    /* The original code caused an unsigned overflow in 64 bit systems,
    so now we use a conditional statement. */
    if (op == OP_XCLASS)
      cc += GET(cc, 1);
    else
      cc += PRIV(OP_lengths)[OP_CLASS];
#else
    cc += PRIV(OP_lengths)[OP_CLASS];
#endif

    switch (*cc)
      {
      case OP_CRSTAR:
      case OP_CRMINSTAR:
      case OP_CRPLUS:
      case OP_CRMINPLUS:
      case OP_CRQUERY:
      case OP_CRMINQUERY:
      case OP_CRPOSSTAR:
      case OP_CRPOSPLUS:
      case OP_CRPOSQUERY:
      return -1;

      case OP_CRRANGE:
      case OP_CRMINRANGE:
      case OP_CRPOSRANGE:
      if (GET2(cc,1) != GET2(cc,1+IMM2_SIZE)) return -1;
      branchlength += (int)GET2(cc,1);
      cc += 1 + 2 * IMM2_SIZE;
      break;

      default:
      branchlength++;
      }
    break;

    /* Anything else is variable length */

    case OP_ANYNL:
    case OP_BRAMINZERO:
    case OP_BRAPOS:
    case OP_BRAPOSZERO:
    case OP_BRAZERO:
    case OP_CBRAPOS:
    case OP_EXTUNI:
    case OP_KETRMAX:
    case OP_KETRMIN:
    case OP_KETRPOS:
    case OP_MINPLUS:
    case OP_MINPLUSI:
    case OP_MINQUERY:
    case OP_MINQUERYI:
    case OP_MINSTAR:
    case OP_MINSTARI:
    case OP_MINUPTO:
    case OP_MINUPTOI:
    case OP_NOTMINPLUS:
    case OP_NOTMINPLUSI:
    case OP_NOTMINQUERY:
    case OP_NOTMINQUERYI:
    case OP_NOTMINSTAR:
    case OP_NOTMINSTARI:
    case OP_NOTMINUPTO:
    case OP_NOTMINUPTOI:
    case OP_NOTPLUS:
    case OP_NOTPLUSI:
    case OP_NOTPOSPLUS:
    case OP_NOTPOSPLUSI:
    case OP_NOTPOSQUERY:
    case OP_NOTPOSQUERYI:
    case OP_NOTPOSSTAR:
    case OP_NOTPOSSTARI:
    case OP_NOTPOSUPTO:
    case OP_NOTPOSUPTOI:
    case OP_NOTQUERY:
    case OP_NOTQUERYI:
    case OP_NOTSTAR:
    case OP_NOTSTARI:
    case OP_NOTUPTO:
    case OP_NOTUPTOI:
    case OP_PLUS:
    case OP_PLUSI:
    case OP_POSPLUS:
    case OP_POSPLUSI:
    case OP_POSQUERY:
    case OP_POSQUERYI:
    case OP_POSSTAR:
    case OP_POSSTARI:
    case OP_POSUPTO:
    case OP_POSUPTOI:
    case OP_QUERY:
    case OP_QUERYI:
    case OP_REF:
    case OP_REFI:
    case OP_DNREF:
    case OP_DNREFI:
    case OP_SBRA:
    case OP_SBRAPOS:
    case OP_SCBRA:
    case OP_SCBRAPOS:
    case OP_SCOND:
    case OP_SKIPZERO:
    case OP_STAR:
    case OP_STARI:
    case OP_TYPEMINPLUS:
    case OP_TYPEMINQUERY:
    case OP_TYPEMINSTAR:
    case OP_TYPEMINUPTO:
    case OP_TYPEPLUS:
    case OP_TYPEPOSPLUS:
    case OP_TYPEPOSQUERY:
    case OP_TYPEPOSSTAR:
    case OP_TYPEPOSUPTO:
    case OP_TYPEQUERY:
    case OP_TYPESTAR:
    case OP_TYPEUPTO:
    case OP_UPTO:
    case OP_UPTOI:
    return -1;

    /* Catch unrecognized opcodes so that when new ones are added they
    are not forgotten, as has happened in the past. */

    default:
    return -4;
    }
  }
/* Control never gets here */
}

Regex engines are designed to work from left to right . 正则表达式引擎设计为从左到右工作

For lookaheads, the engine matches the entire text at the right of current position. 对于前瞻,引擎将在当前位置的右侧匹配整个文本。 However, for lookbehinds, the regex engine determines the length of string to step back and then checks for the match (again left to right). 但是,对于后退,正则表达式引擎确定要退后的字符串的长度,然后检查匹配项(从左到右)。

So, if you provide some infinite quantifiers like * or + , lookbehind wont work because the engine does not know how many steps to go backward. 因此,如果您提供*+类的无限量词,则后向搜索将不起作用,因为引擎不知道向后退了多少步。

I'll give an example of how lookbehind works (the example is pretty silly though). 我将举一个后视如何工作的示例(尽管这个示例很愚蠢)。

Suppose you want to match the last name Panta , only if the first name is 5-7 characters long. 假设仅当姓氏长度为5-7个字符时, 想匹配姓氏Panta

Let's take the string: 让我们来看一下字符串:

Full name is Subigya Panta.

Consider the regex: 考虑正则表达式:

(?<=\b\w{5,7}\b)\sPanta

How the engine works 引擎如何运作

The engine acknowledges the existence of a positive lookbehind and so it first searches for the word Panta (with a whitespace character before it). 引擎确认后面存在正向表情 ,因此首先搜索单词Panta (其前面带有空格字符)。 It is a match. 这是一场比赛。

Now, the engine looks to match the regex inside the lookbehind. 现在,引擎看起来与后面的正则表达式匹配。 It steps backward 7 characters (as the quantifier is greedy). 它后退7个字符(因为量词是贪婪的)。 The word boundary matches the position between space and S . 单词边界匹配空间和S之间的位置。 Then it matches all the 7 characters, and then the next word boundary matches the position between a and the space. 然后,它与所有7个字符匹配,然后下一个单词边界与a和空格之间的位置匹配。

The regex inside the lookbehind is a match and thus the whole regex returns true because the matched string contains Panta . 后面的内部正则表达式是一个匹配项,因此整个正则表达式返回true,因为匹配的字符串包含Panta (Note that lookaround assertions are zero-width, and do not consume any characters.) (请注意,环视断言的宽度为零,并且不占用任何字符。)

The pcrepattern man page documents the restriction that lookbehind assertions must be either be fixed-width, or be several fixed width patterns separated by | pcrepattern手册页记录了以下限制:后置断言必须为固定宽度,或者为多个固定宽度模式,并由|分隔| 's, and then explains that this is because: ,然后解释这是因为:

The implementation of lookbehind assertions is, for each alternative, to temporarily move the current position back by the fixed length and then try to match. 对于每种选择,后向断言的实现是将当前位置临时移回固定长度,然后尝试进行匹配。 If there are insufficient characters before the current position, the assertion fails. 如果当前位置之前没有足够的字符,则断言失败。

I'm not sure why they do it this way, but my guess is that they spent a lot of time writing a good backtracking RE-matching engine that runs forward, and they didn't want to duplicate all that effort to write another that runs backwards. 我不确定为什么要这样做,但是我想他们花了很多时间来编写一个很好的回溯RE匹配引擎,并且可以向前运行,他们不想重复所有的工作来编写另一个向后跑。 The obvious approach would be to run over the string backwards -- that's easy -- while matching a "reverse" version of your lookbehind assertion. 一种明显的方法是向后遍历字符串(这很容易),同时匹配后置断言的“反向”版本。 Reversing a "real" (DFA-matchable) RE is possible -- the reverse of a regular language is a regular language -- but PCRE's "extended" RE's are IIRC turing complete, and it may not even be possible to flip one around to run backwards efficiently in general. 可以逆转“真实”(可与DFA匹配)的RE-普通语言的反向是普通语言-但PCRE的“扩展” RE的IIRC正在完善,甚至可能无法翻动一般而言,可以高效地向后运行。 And even if it were, probably no-one has actually cared enough to bother. 即使是这样,也可能没有人真正关心到足够麻烦。 After all, lookbehind assertions are a pretty minor feature in the grand scheme of things. 毕竟,后置断言在事物的宏伟计划中只是一个很小的功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 前瞻和后视概念如何在Ruby的Regex中支持这种零宽度断言概念? - How the Look-ahead and Look-behind concept supports such Zero-Width Assertions concept in Regex of Ruby? 是否可以在正则表达式中将两个单独的环视/零宽度断言(即,lookbehind / look-behind)相加? - Is it possible to AND two separate lookaround/zero-width assertions (i.e. lookbehind/look-behind) in a regular expression? 你能在String split中使用零宽度匹配正则表达式吗? - Can you use zero-width matching regex in String split? 在两个零宽度的断言上拆分字符串 - Splitting a string on two zero-width assertions 使用零宽度断言查找匹配位置 - Finding position of match with zero-width assertions 正则表达式不包括匹配组中的零宽度断言 - Regular Expression don't include Zero-Width assertions in match groups 零宽度负向后看断言无法按预期工作 - Zero-width negative look-behind assertion does not work as intended 为什么零宽度匹配正则表达式工作? - Why doesn't zero-width match regex work? 使用RegExp无法用零宽度空间替换空间 - Can't replace space with Zero-width space using RegExp 使用零宽度的断言否定性提前匹配包含字符串“ abc”的字符串 - Using zero-width assertions negative lookahead to match a string that does's contains string “abc”
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM