简体   繁体   English

Ruby正则表达式多次重复捕获

[英]Ruby regex multiple repeating captures

I'm trying to parse a subset of a webpage with regex for just fun. 我正在尝试使用正则表达式解析网页的一部分,只是为了好玩。 It was fun till I encountered with the following problem. 直到遇到以下问题,这才很有趣。 I have a paragraph like below; 我有一段如下:

foo: 1, 2, 3, 4 and 5.
bar: 1, 2 and 3.

What I am trying to do is, get the numbers in the first line of the paragraph starting with foo: by applying following regex: 我想做的是,通过应用以下正则表达式,获取以foo:开头的段落第一行中的数字:

foo:(?:\s(\d)(?:,|\sand|\.))+

This matches with the above string but it captures only the last occurrence of the capture group which is 5 . 这与上面的字符串匹配,但是它仅捕获捕获组的最后一次出现,即5

How can I capture all the numbers in a paragraph starting with foo: till the first occurrence of . 如何捕获以foo:开头的段落中的所有数字,直到第一次出现. using single regex pattern. 使用单个正则表达式模式。

Repeating capturing group's data aren't stored separately in most programming languages, hence you can't refer to them individually. 在大多数编程语言中,重复捕获组的数据并不是单独存储的,因此不能单独引用它们。 This is a valid reason to use \\G anchor. 这是使用\\G锚的正当理由。 \\G causes a match to start from where previous match ended or it will match beginning of string as same as \\A . \\G使匹配从先前的匹配结束处开始,否则它将与\\A相同,匹配字符串的开头。

So we are in need of its first capability: 因此,我们需要它的第一个功能:

(?:foo:|\G(?!\A))\s*(\d+)\s*(?:,|and)?

Breakdown: 分解:

  • (?: Start a non-capturing group (?:启动一个非捕获组
    • foo: Match foo: foo:匹配foo:
    • | Or 要么
    • \\G(?!\\A) Continue match from where previous match ends \\G(?!\\A)从上一场比赛结束的地方继续比赛
  • ) End of NCG ) NCG结束
  • \\s* Any number of whitespace characters \\s*任意数量的空格字符
  • (\\d+) Match and capture digits (\\d+)匹配并捕获数字
  • \\s* Any number of whitespae characters \\s*任意数量的空白字符
  • (?:,|and)? Optional , or and 可选的,and

This regex will begin a match on meeting foo in input string. 这个正则表达式将在输入字符串中与foo相遇开始匹配。 Then tries to find a following digit that precedes a comma or and (whitespaces are allowed around digits). 然后尝试查找逗号或and之前的以下数字(数字周围允许有空格)。

\\K token will reset match. \\K令牌将重置匹配项。 It means it will send a signal to engine to forget whatever is matched so far (but keep whatever is captured) and then leaves cursor right at that position. 这意味着它将向引擎发送信号以忘记到目前为止匹配的任何内容(但是保留捕获的所有内容),然后将光标留在该位置。

I used \\K in Rubular regex to make result set not to have matched strings but captured digits. 我在Rubular正则表达式中使用\\K来使结果集没有匹配的字符串,但捕获了数字。 However Rubular seems to work differently and didn't need \\K . 但是Rubular似乎工作方式不同,不需要\\K It's not a must at all. 这不是必须的。

This answer uses just one regex, but admittedly does a bit of pre- and post-processing. 这个答案仅使用一个正则表达式,但可以接受的是一些预处理和后处理。 (Please allow me a bit of fun. I do think there may be some instructional value here.) (请给我一点乐趣。我确实认为这里可能有一定的指导意义。)

str = "foo: 1, 2, 34, 4 and 5. and 6."

r = /
    \d+             # match one or more digits
    (?=[^.]+:oof\z) # match one or more digits other than a period, followed
                    # by ":oof" at the end of the string, in a positive lookahead
    /x              # free-spacing regex definition mode

str.reverse.scan(r).join(' ').reverse.split
  #=> ["1", "2", "34", "4", "5"]

The steps are as follows. 步骤如下。

s = str.reverse
  #=> ".6 dna .5 dna 4 ,43 ,2 ,1 :oof"
a  = s.scan r
  #=> ["5", "4", "43", "2", "1"]
b  = a.join(' ')
  #=> "5 4 43 2 1"
c  = b.reverse
  #=> "1 2 34 4 5"
c.split
  #=> ["1", "2", "34", "4", "5"]

An empty array is returned if there is no match. 如果没有匹配项,则返回一个空数组。

So, why all the reversing? 那么,为什么全部倒车呢? It's to allow me to use a positive lookahead , which, unlike a positive lookbehind , permits variable-length matches. 这是为了允许我使用正向前行 ,与正向后行不同,该行允许可变长度的匹配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM