[英]python regex: multiline and non-greedy
I have some text like this: 我有这样的文字:
cc.Action = {
};
cc.FiniteTimeAction = {
};
cc.Speed = {
};
And I the result (list) I want is: 而我想要的结果(列表)是:
['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}']
And here's what I have tried: 以下是我的尝试:
input = codecs.open(self.input_file, "r", "utf-8")
content = input.read()
result = re.findall('cc\..*= {.*};', content, re.S)
for r in result:
print r
print '---------------'
And the result is: 结果是:
[
'cc.Action = {
};
cc.FiniteTimeAction = {
};
cc.Speed = {
};'
]
Any suggestion will be appreciated, thanks :) 任何建议将不胜感激,谢谢:)
The beginning of the match seems to be cc.
比赛的开始似乎是
cc.
and the end of match seems to be ;
比赛结束似乎是
;
so we can use pattern: 所以我们可以使用模式:
'cc\.[^;]+'
Meaning, we match cc.
意思是,我们匹配
cc.
and then match every character which is not ;
然后匹配每个不是的字符
;
( []
encloses character class, ^
negates the class). (
[]
包含字符类, ^
否定类)。
You could also use non-greedy repeat *?
你也可以使用非贪婪的重复
*?
, but in this case I would say it's an overkill. ,但在这种情况下,我会说这是一个矫枉过正。 The simpler the regex is the better.
正则表达式越简单越好。
To get desired input you would also have to get rid of newlines. 要获得所需的输入,您还必须摆脱换行符。 Together I would propose:
我一起建议:
result = re.findall('cc\.[^;]*;', content.replace('\n', ''))
The problem is, you are using greedy search. 问题是,你正在使用贪婪的搜索。 You need to use non-greedy search with
?
你需要使用非贪婪的搜索
?
operator 操作者
import re
print [i.replace("\n", "") for i in re.findall(r"cc\..*?{.*?}", data, re.DOTALL)]
# ['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}']
If you don't use .*?
如果你不使用
.*?
, .*{
will match till the last {
in the string. ,
.*{
将匹配到字符串中的最后一个{
。 So, all the strings are considered as a single string. 因此,所有字符串都被视为单个字符串。 When you do non-greedy match, it matches till the first
{
from the current character. 当你进行非贪婪的比赛时,它会匹配到第一个
{
来自当前角色。
Also, this can be done without using RegEx, like this 此外,这可以在不使用RegEx的情况下完成,就像这样
print [item.replace("\n", "") for item in data.split(";") if item]
# ['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}']
Just split the string based on ;
只需基于分割字符串
;
and if the current string is not empty, then replace all the \\n
(newline characters) with empty strings. 如果当前字符串不为空,则用空字符串替换所有
\\n
(换行符)。
As your title suggests, the issue is greediness: cc\\..*=
matches from the beginning of the string to the last =
. 正如你的标题所示,问题是贪婪:
cc\\..*=
从字符串的开头到最后的 =
匹配。
You can avoid this behavior by using lazy quantifier that will try to stop at the earliest occurrence of the following character: 您可以通过使用延迟量词来避免此行为,该量词将尝试在最早出现的下一个字符时停止:
cc\..*?= {.*?};
Demo here: http://regex101.com/r/oL4yG7 . 在这里演示: http : //regex101.com/r/oL4yG7 。
If you split based on ;
如果你基于分裂
;
: :
codes.split(';')
Output: 输出:
['cc.Action = {}', ' cc.FiniteTimeAction = {}', 'cc.Speed = {}', '']
>>> 'cc.Action = {\n};\n\ncc.FiniteTimeAction = {\n\n};\n\ncc.Speed = {\n\n};'.replace('\n','').split(";")
['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}', '']
this will work for you 这对你有用
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.