[英]Ruby regex multiple repeating captures
I'm trying to parse a subset of a webpage with regex for just fun. 我正在尝试使用正则表达式解析网页的一部分,只是为了好玩。 It was fun till I encountered with the following problem.
直到遇到以下问题,这才很有趣。 I have a paragraph like below;
我有一段如下:
foo: 1, 2, 3, 4 and 5.
bar: 1, 2 and 3.
What I am trying to do is, get the numbers in the first line of the paragraph starting with foo:
by applying following regex: 我想做的是,通过应用以下正则表达式,获取以
foo:
开头的段落第一行中的数字:
foo:(?:\s(\d)(?:,|\sand|\.))+
This matches with the above string but it captures only the last occurrence of the capture group which is 5
. 这与上面的字符串匹配,但是它仅捕获捕获组的最后一次出现,即
5
。
How can I capture all the numbers in a paragraph starting with foo:
till the first occurrence of .
如何捕获以
foo:
开头的段落中的所有数字,直到第一次出现.
using single regex pattern. 使用单个正则表达式模式。
Repeating capturing group's data aren't stored separately in most programming languages, hence you can't refer to them individually. 在大多数编程语言中,重复捕获组的数据并不是单独存储的,因此不能单独引用它们。 This is a valid reason to use
\\G
anchor. 这是使用
\\G
锚的正当理由。 \\G
causes a match to start from where previous match ended or it will match beginning of string as same as \\A
. \\G
使匹配从先前的匹配结束处开始,否则它将与\\A
相同,匹配字符串的开头。
So we are in need of its first capability: 因此,我们需要它的第一个功能:
(?:foo:|\G(?!\A))\s*(\d+)\s*(?:,|and)?
Breakdown: 分解:
(?:
Start a non-capturing group (?:
启动一个非捕获组
foo:
Match foo:
foo:
匹配foo:
|
Or \\G(?!\\A)
Continue match from where previous match ends \\G(?!\\A)
从上一场比赛结束的地方继续比赛 )
End of NCG )
NCG结束 \\s*
Any number of whitespace characters \\s*
任意数量的空格字符 (\\d+)
Match and capture digits (\\d+)
匹配并捕获数字 \\s*
Any number of whitespae characters \\s*
任意数量的空白字符 (?:,|and)?
Optional ,
or and
,
或and
This regex will begin a match on meeting foo
in input string. 这个正则表达式将在输入字符串中与
foo
相遇开始匹配。 Then tries to find a following digit that precedes a comma or and
(whitespaces are allowed around digits). 然后尝试查找逗号或
and
之前的以下数字(数字周围允许有空格)。
\\K
token will reset match. \\K
令牌将重置匹配项。 It means it will send a signal to engine to forget whatever is matched so far (but keep whatever is captured) and then leaves cursor right at that position. 这意味着它将向引擎发送信号以忘记到目前为止匹配的任何内容(但是保留捕获的所有内容),然后将光标留在该位置。
I used \\K
in Rubular regex to make result set not to have matched strings but captured digits. 我在Rubular正则表达式中使用
\\K
来使结果集没有匹配的字符串,但捕获了数字。 However Rubular seems to work differently and didn't need \\K
. 但是Rubular似乎工作方式不同,不需要
\\K
It's not a must at all. 这不是必须的。
This answer uses just one regex, but admittedly does a bit of pre- and post-processing. 这个答案仅使用一个正则表达式,但可以接受的是一些预处理和后处理。 (Please allow me a bit of fun. I do think there may be some instructional value here.)
(请给我一点乐趣。我确实认为这里可能有一定的指导意义。)
str = "foo: 1, 2, 34, 4 and 5. and 6."
r = /
\d+ # match one or more digits
(?=[^.]+:oof\z) # match one or more digits other than a period, followed
# by ":oof" at the end of the string, in a positive lookahead
/x # free-spacing regex definition mode
str.reverse.scan(r).join(' ').reverse.split
#=> ["1", "2", "34", "4", "5"]
The steps are as follows. 步骤如下。
s = str.reverse
#=> ".6 dna .5 dna 4 ,43 ,2 ,1 :oof"
a = s.scan r
#=> ["5", "4", "43", "2", "1"]
b = a.join(' ')
#=> "5 4 43 2 1"
c = b.reverse
#=> "1 2 34 4 5"
c.split
#=> ["1", "2", "34", "4", "5"]
An empty array is returned if there is no match. 如果没有匹配项,则返回一个空数组。
So, why all the reversing? 那么,为什么全部倒车呢? It's to allow me to use a positive lookahead , which, unlike a positive lookbehind , permits variable-length matches.
这是为了允许我使用正向前行 ,与正向后行不同,该行允许可变长度的匹配。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.