简体   繁体   中英

Ruby Regex: Capture the same capture groups multiple times

I've got these strings in our authentication system, and I'm trying to develop the right REGEX to capture specific information from them.

STRING = CR*reduced*downsized*U*reduced*D*own_only

Now I need to be able to extract the capital Letter (like CRUD) in the first capture group, plus the immediately following 'attributes' embraced by stars (eg downsized ). That works quite fine for most cases with the following REGEX

(C)\*?([a-z_]+)?\*?    --> Capture Group 1: "C", Capture Group2: empty
(U)\*?([a-z_]+)?\*?    --> Capture Group 1: "U", Capture Group2: 'reduced'
(C)\*?([a-z_]+)?\*?    --> Capture Group 1: "D", Capture Group2: 'own_only'

For R, I would need both attributes returned, hence Capture Group2 should be 'reduced' and Capture Group3 'downsized'. But with the same REGEX, I only get the following result

(R)\*?([a-z_]+)?\*?    --> Capture Group 1: "R", Capture Group2: 'reduced'

Any recommendation regarding Regex?

Since this is a scenario involving repeated capturing groups , you can use a multiple step solution like

text = 'CR*reduced*downsized*U*reduced*D*own_only'
rx = /([CRUD])((?:\*[a-z_]+(?:\*[a-z_]+)*(?:\*|$))?)/
matches = text.scan(rx)
p matches.map { |x| [x[0], x[1].split("*").reject(&:empty?)]};
# => [["C", []], ["R", ["reduced", "downsized"]], ["U", ["reduced"]], ["D", ["own_only"]]]

See the Ruby demo and the regex demo .

Details :

  • ([CRUD]) - Group 1: one of the 4 letters
  • ((?:\*[a-z_]+(?:\*[a-z_]+)*(?:\*|$))?) - Group 2: an optional sequence of
    • \* - a * char
    • [a-z_]+ - one or more ASCII lowercase letters or underscores
    • (?:\*[a-z_]+)* - zero or more sequences of * and one or more ASCII lowercase letters or underscores
    • (?:\*|$) - * or end of a line (use \z to match end of whole string).

With .map { |x| [x[0], x[1].split("*").reject(&:empty?)]} .map { |x| [x[0], x[1].split("*").reject(&:empty?)]} , you can split the second group value with * and remove empty items.

One way to extract the desired information is as follows.

def breakup(str)
  str.scan(/[A-Z][a-z_*]*/).map { |s| [s[0], s.scan(/[a-z_]+/)]}
end

The respective regular expressions read, "match an uppercase letter followed by zero or more (the final '*') lowercases letters, underscores and asterisks" and "match one or more (the '+') lowercase letters and underscores".


str1 = "CR*reduced*downsized*U*reduced*D*own_only"
breakup(str1)
  #=> [["C", []], ["R", ["reduced", "downsized"]], ["U", ["reduced"]],
  #    ["D", ["own_only"]]]
str2 = "CR*reduced*downsized*U*reduced*DE"    
breakup(str2)
  #=> [["C", []], ["R", ["reduced", "downsized"]], ["U", ["reduced"]],
  #    ["D", []], ["E", []]]

Note that

str1.scan(/[A-Z][a-z_*]*/)
  #=> ["C", "R*reduced*downsized*", "U*reduced*", "D*own_only"]

Should it be necessary to test if the string has a valid construction one could attempt to match it against the following regular expression (which I've expressed in free-spacing mode to make it self-documenting).

RGX =
  /
  \A              # match beginning of the string
  (?:             # begin a non-capture group
    [A-Z]         # match one upcase letter
    (?:           # begin a non-capture group
      (?:         # begin a non-capture group
        \*        # match '*'
        [a-z_]+   # match one or more of the characters indicated
      )+          # end non-capture group and execute one or more times 
      (?:         # begin a non-capture group
        \*        # match '*'
        (?=[A-Z]) # pos lookahead asserts next char is an upcase letter
        |         # or     
        \z        # at end of string
      )           # end non-capture group
    )?            # end non-capture group and optionally match it
  )+              # end non-capture group and execute it one or more times
  \z              # match end of string
  /x              # invoke free-spacing regex definition mode

str3 = "CR*reduced*downsizedU*reduced*D*own_only"
str4 = "CR*reduced*downsized*U*reduced*D*own_only*"
str5 = "*CR*reduced*downsized*U*reduced*D*own_only"

[str1, str2, str3, str4, str5].map { |s| valid?(s) }
  #=> [true, true, false, false, false]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM