简体   繁体   English

Python Regex:如何提取括号和引号之间的字符串(如果存在)

[英]Python Regex: How to extract string between parentheses and quotes if they exist

I am trying to extract the value/argument of each trigger in Jenkinsfiles between the parentheses and the quotes if they exist. 我试图在括号和引号之间提取Jenkinsfiles中每个触发器的值/参数(如果存在)。

For example, given the following: 例如,给出以下内容:

upstream(upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS)  # just parentheses
pollSCM('H * * * *')     # single quotes and parentheses

Desired result respectively: 所需结果分别为:

upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS
H * * * *

My current result: 我目前的结果:

upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS
H * * * *'        # Notice the trailing single quote

So far I have been successful with the first trigger (upstream one), but not for the second one (pollSCM) because there's still a trailing single quote. 到目前为止,我已经成功使用了第一个触发器(上游触发器),但是没有成功使用第二个触发器(pollSCM),因为仍然有尾随的单引号。

After the group (.+) , it doesn't capture the trailing single quote with \\'* , but it does capture the close parenthesis with \\) . 在组(.+) ,它不使用\\'*捕获尾随单引号,但它使用\\)捕获右括号。 I could simply use .replace() or .strip() to remove it, but what is wrong with my regex pattern? 我可以简单地使用.replace()或.strip()删除它,但是我的正则表达式模式出了什么问题? How can I improve it? 我该如何改善? Here's my code: 这是我的代码:

pattern = r"[A-Za-z]*\(\'*\"*(.+)\'*\"*\)"
text1 = r"upstream(upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS)"
text2 = r"pollSCM('H * * * *')"
trigger_value1 = re.search(pattern, text1).group(1)
trigger_value2 = re.search(pattern, text2).group(1)
import re
s = """upstream(upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS)  # just parentheses
pollSCM('H * * * *')"""
print(re.findall("\((.*?)\)", s))

Output: 输出:

["upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS", "'H * * * *'"]

Your \\'* part of it means 0 or more matches for your single tick so the .+ will grab the last ' because it's greedy. 您的\\'*部分表示您的单个刻度的0 or more matches ,因此.+会抓住最后一个'因为它很贪婪。 You need to add the ? 您需要添加? to (.+) for it to not be greedy. (.+) ,以免贪婪。 Basically it means to grab everything until it comes across the ' . 从根本上讲,它意味着抓住一切直到碰到'

This pattern will work for you: [A-Za-z]*\\(\\'*\\"*(.+?)\\'*\\"*\\) 此模式将为您工作: [A-Za-z]*\\(\\'*\\"*(.+?)\\'*\\"*\\)

[UPDATE] [UPDATE]

To answer your question below I'll just add it here. 要在下面回答您的问题,请在此处添加。

So the ? will make it not greedy up until the next character indicated in the pattern?

Yes, it basically changes repetition operators to not be greedy (lazy quantifier) because they are greedy by default. 是的,它基本上将重复运算符更改为不贪婪(惰性量词),因为默认情况下它们是贪婪的。 So .*?a will match everything until the first a while .*a will match everything including any a found in the string until it can't match against the string anymore. 所以.*?a会匹配所有内容,直到头a时间.*a会匹配所有内容,包括字符串中找到的所有a ,直到不再与字符串匹配为止。 So if your string is aaaaaaaa and your regex is .*?a it will actually match every a . 因此,如果您的字符串是aaaaaaaa而正则表达式是.*?a则它实际上将与每个a匹配。 As an example, if you use .*?a with a substitution of b for every match on string aaaaaaaa you will get the string bbbbbbbb . 例如,如果在字符串aaaaaaaa上的每个匹配项中使用.*?a并用b替换b ,您将获得字符串bbbbbbbb .*a however on string aaaaaaaa with same substitution will return a single b . .*a但是在字符串aaaaaaaa具有相同的替换将返回单个b

Here's a link that explains the different quantifier types (greedy, lazy, possessive): http://www.rexegg.com/regex-quantifiers.html 这是一个说明不同量词类型(贪婪,懒惰,所有格)的链接: http : //www.rexegg.com/regex-quantifiers.html

For you example data your could make the ' optional '? 对于您的示例数据,您可以使'可选'? and capture your values in a group and then loop through the captured groups. 并在组中捕获您的值,然后遍历捕获的组。

\\('?(.*?)'?\\)

test_str = ("upstream(upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS)  # just parentheses\n"
    "pollSCM('H * * * *')     # single quotes and parentheses")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches):    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1  
        print (match.group(groupNum))

Demo Python 演示Python

That would give you: 那会给你:

upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS
H * * * *

To get a more strict match you could use an alternation to match between () or ('') but not with a single ' like ('H * * * *) and then loop through the captured groups. 为了获得更严格的匹配,您可以使用交替来匹配()('')但不能与单个' like ('H * * * *)匹配,然后在捕获的组之间循环。 Because you now capture 2 groups where 1 of the 2 is empty you could check that you only retrieve a non empty group. 因为您现在捕获了2个组,其中2个组中的1个为空,所以可以检查您是否仅检索了一个非空组。

\\((?:'(.*?)'|([^'].*?[^']))\\)

Demo Python 演示Python

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM