简体   繁体   中英

Python regex match until certain word after identaion

Given the following string or similar:

baz: bar
key: >
   lorem ipsum 1213 __ ^123   
   lorem ipsum

foo:bar
anotherkey: >
   lorem ipsum 1213 __ ^123   
   lorem ipsum

I am trying to build a REGEX which captures all values after a key followed by a > sign.

So for the above example, I want to match from key to foo (excluding) and then from anotherkey to the end. I managed to come up with a REGEX which does the job, but only if I know the name of foo :

\w+:\s>\n\s+[\S+\s+]+(?=foo)

But this is not really a good solution. If I remove ?=foo then the match will include everything to the end of the string. How can I fix this regex to do the match the values after > as described?

(As per request ;)

You could use something like

^\w+:\s*>\n(?:[ \t].*\n?)+

(This is without the groups. If you decide you wan't them, see the comments to the question.)

It matches the start of a line ( ^ ) followed by at least one word character ( \\w AZ, az, 0-9 or '-'. Could be changed to [az] if only lower case alphas should be allowed).

Then it matches optional spaces ( \\s* ) followed by the > key-terminator and a line feed ( \\n ).

Then a non-capturing group ( (?: ) matching:

  • a space or a tab
  • followed by any character up to a line feed
  • an optional line feed

This group (matching an indented line) can be repeated any number of times (but must exist at least once - )+ ).

See it here at regex101 .

You can tweak your regex to this:

(\w+:\s+>\n\s+[\S\s]+?)(?=\n\w+:\w+\n|\Z)

RegEx Demo

Lookahead (?=\\n\\w+:\\w+\\n|\\Z) will assert presence of key:value or end of input ( \\Z ) after your non-greedy match.

Alternatively this better performing regex can be used (thanks to Wiktor for the helpful comments below):

\w+:\s+>\n(.*(?:\n(?!\n\w+:\w+\n).*)+)

RegEx Demo 2

One

If you are not sure about indentations whether or not they exist, then this is the simplest way you can achieve desired result:

^\w+:\s+>(?:\s?[^:]*$)*

Live demo

Explanation:

^               # Start of line
\w+:\s+>        # Match specific block
(?:             # Start of non-capturing group (a)
    \s?             # Match a newline
    [^:]*$          # Match rest of line if only it doesn't have a :
)*              # End of non-capturing group (a) (zero or more times - greedy)

You need m flag to be on as demonstrated in live demo.

Two - the simplest

If leading white-spaces are always there, then you can go with this safer regex:

^\w+:\s+>(?:\s?[\t ]+.*)*

Live demo

m modifier should be set here as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM