简体   繁体   English

python正则表达式:多线和非贪婪

[英]python regex: multiline and non-greedy

I have some text like this: 我有这样的文字:

cc.Action = {
};

cc.FiniteTimeAction = {

};

cc.Speed = {

};

And I the result (list) I want is: 而我想要的结果(列表)是:

['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}']

And here's what I have tried: 以下是我的尝试:

input = codecs.open(self.input_file, "r", "utf-8")
content = input.read()
result = re.findall('cc\..*= {.*};', content, re.S)
for r in result:
    print r
    print '---------------'

And the result is: 结果是:

[
'cc.Action = {
};

cc.FiniteTimeAction = {

};

cc.Speed = {

};'
]

Any suggestion will be appreciated, thanks :) 任何建议将不胜感激,谢谢:)

The beginning of the match seems to be cc. 比赛的开始似乎是cc. and the end of match seems to be ; 比赛结束似乎是; so we can use pattern: 所以我们可以使用模式:

'cc\.[^;]+'

Meaning, we match cc. 意思是,我们匹配cc. and then match every character which is not ; 然后匹配每个不是的字符; ( [] encloses character class, ^ negates the class). []包含字符类, ^否定类)。

You could also use non-greedy repeat *? 你也可以使用非贪婪的重复*? , but in this case I would say it's an overkill. ,但在这种情况下,我会说这是一个矫枉过正。 The simpler the regex is the better. 正则表达式越简单越好。

To get desired input you would also have to get rid of newlines. 要获得所需的输入,您还必须摆脱换行符。 Together I would propose: 我一起建议:

result = re.findall('cc\.[^;]*;', content.replace('\n', ''))

The problem is, you are using greedy search. 问题是,你正在使用贪婪的搜索。 You need to use non-greedy search with ? 你需要使用非贪婪的搜索? operator 操作者

import re
print [i.replace("\n", "") for i in re.findall(r"cc\..*?{.*?}", data, re.DOTALL)]
# ['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}']

If you don't use .*? 如果你不使用.*? , .*{ will match till the last { in the string. .*{将匹配到字符串中的最后一个{ So, all the strings are considered as a single string. 因此,所有字符串都被视为单个字符串。 When you do non-greedy match, it matches till the first { from the current character. 当你进行非贪婪的比赛时,它会匹配到第一个{来自当前角色。

Also, this can be done without using RegEx, like this 此外,这可以在不使用RegEx的情况下完成,就像这样

print [item.replace("\n", "") for item in data.split(";") if item]
# ['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}']

Just split the string based on ; 只需基于分割字符串; and if the current string is not empty, then replace all the \\n (newline characters) with empty strings. 如果当前字符串不为空,则用空字符串替换所有\\n (换行符)。

As your title suggests, the issue is greediness: cc\\..*= matches from the beginning of the string to the last = . 正如你的标题所示,问题是贪婪: cc\\..*=从字符串的开头到最后的 =匹配。

You can avoid this behavior by using lazy quantifier that will try to stop at the earliest occurrence of the following character: 您可以通过使用延迟量词来避免此行为,该量词将尝试在最早出现的下一个字符时停止:

cc\..*?= {.*?};

Demo here: http://regex101.com/r/oL4yG7 . 在这里演示: http//regex101.com/r/oL4yG7

If you split based on ; 如果你基于分裂; :

codes.split(';')

Output: 输出:

['cc.Action = {}', ' cc.FiniteTimeAction = {}', 'cc.Speed = {}', '']
>>> 'cc.Action = {\n};\n\ncc.FiniteTimeAction = {\n\n};\n\ncc.Speed = {\n\n};'.replace('\n','').split(";")
['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}', '']

this will work for you 这对你有用

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM