[英]Python regex match until certain word after identaion
Given the following string or similar: 给定以下字符串或类似内容:
baz: bar
key: >
lorem ipsum 1213 __ ^123
lorem ipsum
foo:bar
anotherkey: >
lorem ipsum 1213 __ ^123
lorem ipsum
I am trying to build a REGEX which captures all values after a key followed by a >
sign. 我试图建立一个正则表达式,它捕获一个键后跟一个>
符号后的所有值。
So for the above example, I want to match from key
to foo
(excluding) and then from anotherkey
to the end. 因此对于上面的示例,我想从key
到foo
(不包括)匹配,然后从anotherkey
到末尾anotherkey
。 I managed to come up with a REGEX which does the job, but only if I know the name of foo
: 我设法提出一个可完成此工作的REGEX,但前提是我知道foo
的名称:
\w+:\s>\n\s+[\S+\s+]+(?=foo)
But this is not really a good solution. 但这并不是一个好的解决方案。 If I remove ?=foo
then the match will include everything to the end of the string. 如果删除?=foo
则匹配项将包含字符串的所有内容。 How can I fix this regex to do the match the values after >
as described? 我如何解决此正则表达式以匹配>
后所述的值?
(As per request ;) (按要求 ;)
You could use something like 您可以使用类似
^\w+:\s*>\n(?:[ \t].*\n?)+
(This is without the groups. If you decide you wan't them, see the comments to the question.) (这没有小组。如果您决定不参加,请参阅问题的注释。)
It matches the start of a line ( ^
) followed by at least one word character ( \\w
AZ, az, 0-9 or '-'. Could be changed to [az]
if only lower case alphas should be allowed). 它与行( ^
)的开头匹配,后接至少一个单词字符( \\w
AZ,az,0-9或'-'。如果只允许使用小写字母,则可以更改为[az]
)。
Then it matches optional spaces ( \\s*
) followed by the >
key-terminator and a line feed ( \\n
). 然后,它匹配可选的空格( \\s*
),后跟>
键终止符和换行符( \\n
)。
Then a non-capturing group ( (?:
) matching: 然后是一个非捕获组( (?:
:)匹配:
This group (matching an indented line) can be repeated any number of times (but must exist at least once - )+
). 该组(与缩进线匹配)可以重复任意次(但必须至少存在一次- )+
)。
You can tweak your regex to this: 您可以将正则表达式调整为:
(\w+:\s+>\n\s+[\S\s]+?)(?=\n\w+:\w+\n|\Z)
Lookahead (?=\\n\\w+:\\w+\\n|\\Z)
will assert presence of key:value
or end of input ( \\Z
) after your non-greedy match. 在非贪婪匹配之后,先行(?=\\n\\w+:\\w+\\n|\\Z)
会断言key:value
或输入结尾( \\Z
)的存在。
Alternatively this better performing regex can be used (thanks to Wiktor for the helpful comments below): 另外,也可以使用性能更好的正则表达式(感谢Wiktor提供以下有用的注释):
\w+:\s+>\n(.*(?:\n(?!\n\w+:\w+\n).*)+)
If you are not sure about indentations whether or not they exist, then this is the simplest way you can achieve desired result: 如果不确定缩进是否存在,那么这是获得所需结果的最简单方法:
^\w+:\s+>(?:\s?[^:]*$)*
Explanation: 说明:
^ # Start of line
\w+:\s+> # Match specific block
(?: # Start of non-capturing group (a)
\s? # Match a newline
[^:]*$ # Match rest of line if only it doesn't have a :
)* # End of non-capturing group (a) (zero or more times - greedy)
You need m
flag to be on as demonstrated in live demo. 如现场演示中所示,您需要打开m
标志。
If leading white-spaces are always there, then you can go with this safer regex: 如果前导空格始终存在,那么可以使用此更安全的正则表达式:
^\w+:\s+>(?:\s?[\t ]+.*)*
m
modifier should be set here as well. m
修饰符也应在此处设置。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.