Regex for getting multiple words after a delimiter

Question

I have been trying to get the separate groups from the below string using regex in PCRE:

drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=bd=e) obs=4 where = (foo > 45 and bar == 35)

Groups I am trying to make is like:

1. drop = blah blah blah something
2. keep = bar foo nlah aaaa
3. rename = (a=b d=e)
4. obs=4
5. where = (foo > 45 and bar == 35)

I have written a regex using recursion but for some reason recursion is partially working for selecting multiple words after drop like it's selecting just first 3 words (blah blah blah) and not the 4th one. I have looked through various stackoverflow questions and have tried using positive lookahead also but this is the closest I could get to and now I am stuck because I am unable to understand what I am doing wrong.

Same can be seen here: RegEx Demo .

Any help on this or understanding what I am doing wrong is appreciated.

Answer 1

You could use the newer regex module with DEFINE :

(?(DEFINE)
    (?<key>\w+)
    (?<sep>\s*=\s*)
    (?<value>(?:(?!(?&key)(?&sep))[^()=])+)
    (?<par>\((?:[^()]+|(?&par))+\))
)
(?P<k>(?&key))(?&sep)(?P<v>(?:(?&value)|(?&par)))

See a demo on regex101.com .

In Python this could be:

import regex as re

data = """
drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)
"""

rx = re.compile(r'''
(?(DEFINE)
    (?<key>\w+)
    (?<sep>\s*=\s*)
    (?<value>(?:(?!(?&key)(?&sep))[^()=])+)
    (?<par>\((?:[^()]+|(?&par))+\))
)

(?P<k>(?&key))(?&sep)(?P<v>(?:(?&value)|(?&par)))''', re.X)

result = {m.group('k').strip(): m.group('v').strip()
          for m in rx.finditer(data)}

print(result)

And yields

{'drop': 'blah blah blah something', 'keep': 'bar foo nlah aaaa', 'rename': '(a=b d=e)', 'obs': '4', 'where': '(foo > 45 and bar == 35)'}

Answer 2

You can use a branch reset group solution:

(?i)\b(drop|keep|where|rename|obs)\s*=\s*(?|(\w+(?:\s+\w+)*)(?=\s+\w+\s+=|$)|\((.*?)\))

See the PCRE regex demo

Details

(?i) - case insensitive mode on
\\b - a word boundary
(drop|keep|where|rename|obs) - Group 1: any of the words in the group
\\s*=\\s* - a = char enclosed with 0+ whitespace chars
(?| - start of a branch reset group:
- (\\w+(?:\\s+\\w+)*) - Group 2: one or more word chars followed with zero or more repetitions of one or more whitespaces and one or more word chars
- (?=\\s+\\w+\\s+=|$) - up to one or more whitespaces, one or more word chars, one or more whitespaces, and = , or end of string
- | - or
  - \$(.*?)\$ - ( , then Group 2 capturing any zero or more chars other than line break chars, as few as possible and then )
) - end of the branch reset group.

See Python demo :

import regex
pattern = r"(?i)\b(drop|keep|where|rename|obs)\s*=\s*(?|(\w+(?:\s+\w+)*)(?=\s+\w+\s+=|$)|\((.*?)\))"
text = "drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)"
print( [x.group() for x in regex.finditer(pattern, text)] )
# => ['drop = blah blah blah something', 'keep = bar foo nlah aaaa', 'rename = (a=b d=e)', 'obs=4', 'where = (foo > 45 and bar == 35)']
print( regex.findall(pattern, text) )
# => [('drop', 'blah blah blah something'), ('keep', 'bar foo nlah aaaa'), ('rename', 'a=b d=e'), ('obs', '4'), ('where', 'foo > 45 and bar == 35')]

Regex for getting multiple words after a delimiter

Question

2 answers

solution1
2 2020-09-15 08:02:21

solution2
1 ACCPTED 2020-09-16 13:39:14

Regex for getting multiple words after a delimiter

Question

2 answers

solution1 2 2020-09-15 08:02:21

solution2 1 ACCPTED 2020-09-16 13:39:14

solution1
2 2020-09-15 08:02:21

solution2
1 ACCPTED 2020-09-16 13:39:14