简体   繁体   English

Ruby gsub正则表达式出乎意料的行为

[英]Ruby gsub regex unexpected behaviour

I thought I knew regexes pretty well, but this has me puzzled: 我以为我很了解正则表达式,但这令我困惑:

irb(main):016:0> source = "/foo/bar"
=> "/foo/bar"
irb(main):017:0> source.gsub( /[^\/]*\Z/, "fubar" )
=> "/foo/fubarfubar"

As far as I can tell, /[^\\/]*\\Z/ has a unique expansion to match bar and therefore should result in /foo/fubar . 据我所知, /[^\\/]*\\Z/有一个匹配bar的唯一扩展,因此应该导致/foo/fubar I can't see at all why I get fubarfubar as the replacement. 我根本看不出为什么我把fubarfubar作为替代品。

The replacement works if I call sub rather than gsub , so it's not a question of working around the problem but rather uncovering my misunderstanding of gsub . 如果我调用sub而不是gsub ,替换工作,所以这不是解决问题的问题,而是揭示我对gsub误解。

You need to use sub as you only need to replace once at the end of the string: 你需要使用sub因为你只需要在字符串的末尾替换一次:

source.sub( /[^\/]*\Z/, "fubar" )
       ^^^

See the IDEONE demo 请参阅IDEONE演示

The problem is most probably with the way the matches are collected, and since you pattern matches an empty string, although at the end, the last null can also be treated as a 2nd match. 问题很可能与收集匹配的方式有关,并且由于模式匹配空字符串,尽管最后,最后一个null也可以被视为第二个匹配。 It is not only a Ruby issue, a similar bug is present in many other languages. 它不仅是一个Ruby问题,许多其他语言中也存在类似的错误。

So, actually, this is what is happening: 所以,实际上,这就是发生的事情:

  • [^\\/]*\\Z pattern matches bar and replaces it with foobar [^\\/]*\\Z模式匹配bar并用foobar替换它
  • Regex index is at the end of the string - yes, there is a NULL, but Ruby still sees it as a valid "string" to process and 正则表达式索引位于字符串的末尾 - 是的,有一个NULL,但Ruby仍然认为它是一个有效的“字符串”来处理和
  • [^\\/]*\\Z matches the NULL, and adds another foobar . [^\\/]*\\Z匹配NULL,并添加另一个foobar

If you need to use gsub , replace * quantifier that allows matching 0 characters with + that requires at least 1 occurrence of the quantified subpattern, avoid matching 0-length strings: 如果你需要使用gsub ,替换*量字,允许匹配0个字符+需要至少出现1次量化子模式,避免匹配0长度字符串:

source.gsub( /[^\/]+\Z/, "fubar" )
                   ^

The rule of thumb : Avoid regexps that match empty strings inside Regex replace methods! 经验法则 :避免在Regex替换方法中匹配空字符串的正则表达式!

I don't think this is a bug at all. 我认为这根本不是一个错误。 Regular expressions can and will match zero-width positions. 正则表达式可以并且将匹配零宽度位置。

Therefore, the regex engine sees the string "xox" more like this: 因此,正则表达式引擎看到字符串"xox"更像这样:

"" "x" "" "o" "" "x" ""

(fun fact: in Ruby, the above actually results in "xox" ) (有趣的事实:在Ruby中,上面实际上导致"xox"

If we gsub a single x with a _ , everything works as expected: 如果我们gsub一个x_ ,一切都按预期工作:

"xox".gsub(/x/, "_") #=> "_o_"

But if we match x* , things get weird: 但如果我们匹配x* ,事情会变得奇怪:

"xox".gsub(/x*/, "_") #=> "__o__"

This is because * matches zero or more times: 这是因为*匹配次或多次:

"" "x" "" "o" "" "x" ""
^^^^^^ ^^     ^^^^^^ ^^

It may be clearer if we reduce "zero or more" to just zero: 如果我们将“零或更多”减少到零,可能会更清楚:

"xox".gsub(/x{0}/, "_") #=> "_x_o_x_"

The matches are: 比赛是:

"" "x" "" "o" "" "x" ""
^^     ^^     ^^     ^^

The same happens in your example. 你的例子也是如此。 You match [^\\/] zero or more times. 您匹配[^\\/]零次或多次。 The regex engine matches bar at the end of the string ( [^\\/] 3 times) and the void afterwards ( [^\\/] 0 times): 正则表达式引擎匹配字符串末尾的bar[^\\/] 3次)​​和之后的void( [^\\/] 0次):

"/" "" "b" "" "a" "" "r" ""
    ^^^^^^^^^^^^^^^^^^^^ ^^

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM