简体   繁体   English

字符串结尾的非贪婪Python正则表达式

[英]non greedy Python regex from end of string

I need to search a string in Python 3 and I'm having troubles implementing a non greedy logic starting from the end. 我需要在Python 3中搜索一个字符串,并且从最后开始我遇到了实现非贪婪逻辑的麻烦。

I try to explain with an example: 我试着用一个例子来解释:

Input can be one of the following 输入可以是以下之一

test1 = 'AB_x-y-z_XX1234567890_84481.xml' 
test2 = 'x-y-z_XX1234567890_84481.xml'
test3 = 'XX1234567890_84481.xml'

I need to find the last part of the string ending with 我需要找到以#结尾的字符串的最后一部分

somestring_otherstring.xml somestring_otherstring.xml

In all the above cases the regex should return XX1234567890_84481.xml 在上述所有情况下,正则表达式应返回XX1234567890_84481.xml

My best try is: 我最好的尝试是:

result = re.search('(_.+)?\.xml$', test1, re.I).group()
print(result)

Here I used: 我在这里使用:

(_.+)? to match "_anystring" in a non greedy mode 在非贪婪模式下匹配“_anystring”

\\.xml$ to match ".xml" in the final part of the string \\.xml$匹配字符串最后部分的“.xml”

The output I get is not correct: 我得到的输出不正确:

_x-y-z_XX1234567890_84481.xml

I found some SO questions ( link ) explaining the regex starts from the left even with non greedy qualifier. 我发现了一些SO问题( 链接 )解释正则表达式从左边开始,即使是非贪婪的限定符。

Could anyone explain me how to implement a non greedy regex from the right? 谁能解释我如何从右边实现非贪婪的正则表达式?

You need to use this regex to capture what you want, 您需要使用此正则表达式来捕获您想要的内容,

[^_]*_[^_]*\.xml

Demo 演示

Check out this Python code, 看看这个Python代码,

import re

arr = ['AB_x-y-z_XX1234567890_84481.xml','x-y-z_XX1234567890_84481.xml','XX1234567890_84481.xml']

for s in arr:
 m = re.search(r'[^_]*_[^_]*\.xml', s)
 if (m):
  print(m.group(0))

Prints, 打印,

XX1234567890_84481.xml
XX1234567890_84481.xml
XX1234567890_84481.xml

The problem in your regex (_.+)?\\.xml$ is, (_.+)? 正则表达式中的问题(_.+)?\\.xml$是, (_.+)? part will start matching from the first _ and will match anything until it sees a literal .xml and whole of it is optional too as it is followed by ? part将从第一个_开始匹配,并将匹配任何内容,直到它看到一个文字.xml并且整个它也是可选的,因为它后面跟着? . Due to which in string _x-y-z_XX1234567890_84481.xml , it will also match _x-y-z_XX1234567890_84481 which isn't the correct behavior you desired. 由于在字符串_x-y-z_XX1234567890_84481.xml ,它也将匹配_x-y-z_XX1234567890_84481 ,这不是您所需的正确行为。

Your pattern (_.+)?\\.xml$ captures in an optional group from the first underscore until it can match .xml at the end of the string and it does not take the number of underscores that should be between into account. 您的模式(_.+)?\\.xml$在第一个下划线的可选组中捕获,直到它可以匹配字符串末尾的.xml ,并且它不会考虑应该考虑的下划线数量。

To only match the last part you can omit the capturing group. 要仅匹配最后一部分,您可以省略捕获组。 You could use a negated character class and use the anchor $ to assert the end of the line as it is the last part: 你可以使用一个否定的字符类,并使用anchor $来断言该行的结尾,因为它是最后一部分:

[^_]+_[^_]+\.xml$

Regex demo | 正则表达式演示 | Python demo Python演示

That will match 这将匹配

  • [^_]+ Match 1+ times not _ [^_]+匹配1次以上不_
  • _ Match literally _字面上匹配
  • [^_]+ Match 1+ times not _ [^_]+匹配1次以上不_
  • \\.xml$ Match .xml at the end of the string \\.xml$匹配.xml在字符串的末尾

For example: 例如:

import re

test1 = 'AB_x-y-z_XX1234567890_84481.xml'
result = re.search('[^_]+_[^_]+\.xml$', test1, re.I)
if result:
    print(result.group())

Not sure if this matches what you're looking for conceptually as "non greedy from the right" - but this pattern yields the correct answer: 不确定这是否符合您在概念上所寻找的“非贪婪” - 但这种模式产生了正确的答案:

'[^_]+_[^_]+\.xml$'

The [^_] is a character class matching any character which is not an underscore. [^_]是匹配任何不是下划线的字符的字符类。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM