简体   繁体   中英

Invalid pattern in look-behind

Why does this regex work in Python but not in Ruby:

/(?<!([0-1\b][0-9]|[2][0-3]))/

Would be great to hear an explanation and also how to get around it in Ruby

EDIT w/ the whole line of code:

re.sub(r'(?<!([0-1\b][0-9]|[2][0-3])):(?!([0-5][0-9])((?i)(am)|(pm)|(a\.m)|(p\.m)|(a\.m\.)|(p\.m\.))?\b)' , ':\n' , s)

Basically, I'm trying to add '\\n' when there is a colon and it is not a time.

Acc. to Onigmo regex documentation , capturing groups are not supported in negative lookbehinds . Although it is common among regex engines, not all of them count it as an error, hence you see the difference in the re and Onigmo regex libraries.

Now, as for your regex, it is not working correctly nor in Ruby nor in Python: the \\b inside a character class in a Python and Ruby regex matches a BACKSPACE ( \\x08 ) char, not a word boundary. Moreover, when you use a word boundary after an optional non-word char, if the char appears in the string a word char must appear immediately to the right of that non-word char. The word boundary must be moved to right after m before \\.? .

Another flaw with the current approach is that lookbehinds are not the best to exclude certain contexts like here. Eg you can't account for a variable amount of whitespaces between the time digits and am / pm . It is better to match the contexts you do not want to touch and match and capture those you want to modify. So, we need two main alternatives here, one matching am / pm in time strings and another matching them in all other contexts.

Your pattern also has too many alternatives that can be merged using character classes and ? quantifiers.

Regex demo

  • \\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\\s*[pa]\\.?m\\b\\.?) :
    • \\b - word boundary
    • ((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\\s*[pa]\\.?m\\b\\.?) - capturing group 1:
      • (?:[01]?[0-9]|2[0-3]) - an optional 0 or 1 and then any digit or 2 and then a digit from 0 to 3
      • :[0-5][0-9] - : and then a number from 00 to 59
      • \\s* - 0+ whitespaces
      • [pa]\\.?m\\b\\.? - a or p , an optional dot, m , a word boundary , an optional dot
  • | - or
  • \\b[ap]\\.?m\\b\\.? - word boundary, a or p , an optional dot, m , a word boundary , an optional dot

Python fixed solution :

import re
text = 'am pm  P.M.  10:56pm 10:43 a.m.'
rx = r'\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?'
result = re.sub(rx, lambda x: x.group(1) if x.group(1) else "\n", text, flags=re.I)

Ruby solution :

text = 'am pm  P.M.  10:56pm 10:43 a.m.'
rx = /\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?/i
result = text.gsub(rx) { $1 || "\n" }

Output:

"\n \n  \n  10:56pm 10:43 a.m."

Ruby regex engine doesn't allow capturing groups in look behinds. If you need grouping, you can use a non-capturing group (?:) :

[8] pry(main)> /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
SyntaxError: (eval):2: invalid pattern in look-behind: /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
[8] pry(main)> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
=> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/

Docs:

 (?<!subexp)        negative look-behind

                     Subexp of look-behind must be fixed-width.
                     But top-level alternatives can be of various lengths.
                     ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.

                     In negative look-behind, capturing group isn't allowed,
                     but non-capturing group (?:) is allowed.

Learned from this answer .

For sure @mrzasa found the problem out.

But .. Taking a guess at your intent to replace a non-time colon with a ':\\n`
it could be done like this I guess. Does a little whitespace trim as well.

(?i)(?<!\b[01][0-9])(?<!\b[2][0-3])([^\S\r\n]*:)[^\S\r\n]*(?![0-5][0-9](?:[ap]\.?m\b\.?)?)

PCRE - https://regex101.com/r/7TxbAJ/1 Replace $1\\n

Python - https://regex101.com/r/w0oqdZ/1 Replace \\1\\n

Readable version

 (?i)
 (?<!
      \b [01] [0-9] 
 )
 (?<!
      \b [2] [0-3] 
 )
 (                             # (1 start)
      [^\S\r\n]* 
      :
 )                             # (1 end)
 [^\S\r\n]* 
 (?!
      [0-5] [0-9] 
      (?: [ap] \.? m \b \.? )?
 )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM