Ruby 正则表达式：多次捕获相同的捕获组

Question

I've got these strings in our authentication system, and I'm trying to develop the right REGEX to capture specific information from them.我在我们的身份验证系统中有这些字符串，我正在尝试开发正确的 REGEX 以从中捕获特定信息。

STRING = CR*reduced*downsized*U*reduced*D*own_only

Now I need to be able to extract the capital Letter (like CRUD) in the first capture group, plus the immediately following 'attributes' embraced by stars (eg downsized ).现在我需要能够提取第一个捕获组中的大写字母（如 CRUD），以及紧随其后的星号所包含的“属性”（例如downsized ）。 That works quite fine for most cases with the following REGEX对于大多数使用以下 REGEX 的情况，这非常有效

(C)\*?([a-z_]+)?\*?    --> Capture Group 1: "C", Capture Group2: empty
(U)\*?([a-z_]+)?\*?    --> Capture Group 1: "U", Capture Group2: 'reduced'
(C)\*?([a-z_]+)?\*?    --> Capture Group 1: "D", Capture Group2: 'own_only'

For R, I would need both attributes returned, hence Capture Group2 should be 'reduced' and Capture Group3 'downsized'.对于 R，我需要返回这两个属性，因此 Capture Group2 应该是“reduced”，Capture Group3 应该是“downsized”。 But with the same REGEX, I only get the following result但是使用相同的正则表达式，我只得到以下结果

(R)\*?([a-z_]+)?\*?    --> Capture Group 1: "R", Capture Group2: 'reduced'

Any recommendation regarding Regex?关于正则表达式的任何建议？

Answer 1

Since this is a scenario involving repeated capturing groups , you can use a multiple step solution like由于这是一个涉及重复捕获组的场景，您可以使用像这样的多步骤解决方案

text = 'CR*reduced*downsized*U*reduced*D*own_only'
rx = /([CRUD])((?:\*[a-z_]+(?:\*[a-z_]+)*(?:\*|$))?)/
matches = text.scan(rx)
p matches.map { |x| [x[0], x[1].split("*").reject(&:empty?)]};
# => [["C", []], ["R", ["reduced", "downsized"]], ["U", ["reduced"]], ["D", ["own_only"]]]

See the Ruby demo and the regex demo .请参阅Ruby 演示和正则表达式演示。

Details :详情：

([CRUD]) - Group 1: one of the 4 letters ([CRUD]) - 第 1 组：4 个字母之一
((?:\*[a-z_]+(?:\*[a-z_]+)*(?:\*|$))?) - Group 2: an optional sequence of ((?:\*[a-z_]+(?:\*[a-z_]+)*(?:\*|$))?) - 第 2 组：可选序列
- \* - a * char \* - *字符
- [a-z_]+ - one or more ASCII lowercase letters or underscores [a-z_]+ - 一个或多个 ASCII 小写字母或下划线
- (?:\*[a-z_]+)* - zero or more sequences of * and one or more ASCII lowercase letters or underscores (?:\*[a-z_]+)* - 零个或多个*序列和一个或多个 ASCII 小写字母或下划线
- (?:\*|$) - * or end of a line (use \z to match end of whole string). (?:\*|$) - *或一行的结尾（使用\z匹配整个字符串的结尾）。

With .map { |x| [x[0], x[1].split("*").reject(&:empty?)]}使用.map { |x| [x[0], x[1].split("*").reject(&:empty?)]} .map { |x| [x[0], x[1].split("*").reject(&:empty?)]} , you can split the second group value with * and remove empty items. .map { |x| [x[0], x[1].split("*").reject(&:empty?)]} ，你可以用*拆分第二组值并删除空项。

Answer 2

One way to extract the desired information is as follows.提取所需信息的一种方法如下。

def breakup(str)
  str.scan(/[A-Z][a-z_*]*/).map { |s| [s[0], s.scan(/[a-z_]+/)]}
end

The respective regular expressions read, "match an uppercase letter followed by zero or more (the final '*') lowercases letters, underscores and asterisks" and "match one or more (the '+') lowercase letters and underscores".相应的正则表达式为“匹配一个大写字母后跟零个或多个（最后的‘*’）小写字母、下划线和星号”和“匹配一个或多个（‘+’）小写字母和下划线”。

str1 = "CR*reduced*downsized*U*reduced*D*own_only"
breakup(str1)
  #=> [["C", []], ["R", ["reduced", "downsized"]], ["U", ["reduced"]],
  #    ["D", ["own_only"]]]

str2 = "CR*reduced*downsized*U*reduced*DE"    
breakup(str2)
  #=> [["C", []], ["R", ["reduced", "downsized"]], ["U", ["reduced"]],
  #    ["D", []], ["E", []]]

Note that注意

str1.scan(/[A-Z][a-z_*]*/)
  #=> ["C", "R*reduced*downsized*", "U*reduced*", "D*own_only"]

Should it be necessary to test if the string has a valid construction one could attempt to match it against the following regular expression (which I've expressed in free-spacing mode to make it self-documenting).如果有必要测试字符串是否具有有效的构造，可以尝试将其与以下正则表达式进行匹配（我已经以自由间距模式表示以使其自我记录）。

RGX =
  /
  \A              # match beginning of the string
  (?:             # begin a non-capture group
    [A-Z]         # match one upcase letter
    (?:           # begin a non-capture group
      (?:         # begin a non-capture group
        \*        # match '*'
        [a-z_]+   # match one or more of the characters indicated
      )+          # end non-capture group and execute one or more times 
      (?:         # begin a non-capture group
        \*        # match '*'
        (?=[A-Z]) # pos lookahead asserts next char is an upcase letter
        |         # or     
        \z        # at end of string
      )           # end non-capture group
    )?            # end non-capture group and optionally match it
  )+              # end non-capture group and execute it one or more times
  \z              # match end of string
  /x              # invoke free-spacing regex definition mode

str3 = "CR*reduced*downsizedU*reduced*D*own_only"
str4 = "CR*reduced*downsized*U*reduced*D*own_only*"
str5 = "*CR*reduced*downsized*U*reduced*D*own_only"

[str1, str2, str3, str4, str5].map { |s| valid?(s) }
  #=> [true, true, false, false, false]

Ruby 正则表达式：多次捕获相同的捕获组

问题描述

2 个解决方案

解决方案1
0 2022-12-17 15:05:20

解决方案2
0 2023-01-26 21:16:13

Ruby 正则表达式：多次捕获相同的捕获组

问题描述

2 个解决方案

解决方案1 0 2022-12-17 15:05:20

解决方案2 0 2023-01-26 21:16:13

解决方案1
0 2022-12-17 15:05:20

解决方案2
0 2023-01-26 21:16:13