简体   繁体   English

Python正则表达式不匹配行

[英]Python regular expression not matching line

My example log file is big and contains below lines. 我的示例日志文件很大,并且包含以下几行。

<6>[16495.700255]

Memory - START UC1

<4>16495.723327 C0  Memory - START UC1

<4>[16495.723327] C0 [             sh] Memory - START UC1

I am looking for Memory - START UC1 The below regular expression gets the first two lines but not the third. 我正在寻找Memory - START UC1下面的正则表达式获取前两行,但没有第三行。

re.compile("(Memory - +(.*)$)")

Use re.MULTILINE as a flag to re.compile or add (?m) to the start of the Regex. 使用re.MULTILINE作为标志re.compile或添加(?m)的正则表达式的开始。 The $ only matches the end of the string unless MULTILINE mode is on, when it matches the end of any line. 除非MULTILINE模式打开,否则$只匹配字符串的结尾,否则它将匹配任何行的结尾。

I copied the original regex from your question - re.compile("(Memory - +(.*)$)") into the code from your follow-up answer, and ran that against the sample text from your question, and got all three matches. 我将您问题的原始正则表达式re.compile("(Memory - +(.*)$)")复制到后续回答的代码中,并针对您问题的示例文本进行了处理,三场比赛。

@Smac89's suggestion of re.compile("(.*?Memory - START UC1)") is only necessary if you are calling the regex with event_regex.match(line) , which is implicitly anchored to the beginning of the string ( ^ ); @ Smac89关于re.compile("(.*?Memory - START UC1)")的建议只有在您使用event_regex.match(line)调用正则表达式时才有必要,后者隐式锚定在字符串( ^ )的开头; if you use search(line) or findall(line) then the .*? 如果使用search(line)findall(line).*? doesn't do anything except make the regex harder to read: it non-greedily matches zero or more of anything, so if you're not anchored to the start of the string then it will end up matching zero characters anyway. 除了使正则表达式更难阅读之外,它什么也不做:它非贪婪地匹配零个或多个字符,因此,如果您没有锚定到字符串的开头,那么它最终将匹配零个字符。
And I'm afraid that the suggestion of [^.* ]? 而且我担心[^.* ]?的建议[^.* ]? makes even less sense, unless I'm terribly mistaken (which happens far too often). 毫无意义,除非我犯了一个非常错误的错误(这种情况经常发生)。 That says: match zero or one characters from the character group that consists of all characters except a literal . 也就是说:匹配字符组中的零个或一个字符,该字符组由文字以外的所有字符组成. , a literal * , or a space. ,文字*或空格。 Which, again, if you're not anchored to the beginning of the string, that part of the regex will end up most likely matching zero characters anyway. 同样,如果您没有锚定在字符串的开头,则该正则表达式的该部分最终很有可能会匹配零个字符。

Honestly, if you know that you want to match the exact string Memory - START UC1 , then you're probably better off with a simple line.contains('Memory - START UC1') rather than a regex. 老实说,如果您想匹配确切的字符串Memory - START UC1 ,那么最好使用简单的line.contains('Memory - START UC1')而不是正则表达式。
But your initial regex contained + (that's 'space plus') - one or more spaces - and if the number of spaces can vary, then yes you do want a regex. 但是您的初始正则表达式包含+ (即“空格加号”)-一个或多个空格-如果空格数量可以变化,那么您确实想要一个正则表达式。 You might also consider \\s+ in that case, which matches both spaces and tabs (and a few other rarer whitespacey characters). 在这种情况下,您可能还会考虑\\s+ ,它与空格和制表符(以及其他一些稀有的空白字符)匹配。 If there's a possibility of trailing spaces, then you should put \\s* just before your $ end-of-string anchor. 如果存在尾随空格的可能性,则应将\\s*放在$字符串结尾锚之前。 (I actually suspect that trailing space was the reason your initial regex was not matching that third occurrence of your target string.) (我实际上怀疑尾随空格是您最初的正则表达式与目标字符串的第三次出现不匹配的原因。)

A couple of other tips: 其他一些技巧:
In your initial regex, "(Memory - +(.*)$)" you have two capture groups (ie. sets of parentheses) but I suspect that you only actually want one, depending on whether you're interested only in the "UC1" or all of "Memory - UC1". 在您的初始正则表达式"(Memory - +(.*)$)"您有两个捕获组(即,括号组),但是我怀疑您实际上只想要一个捕获组,这取决于您是否仅对“ UC1”或“内存-UC1”的全部。
Also, your if not line: clause never fires, because blank lines still have a linebreak. 另外, if not line:if not line:子句从不触发,因为空行仍然具有换行符。 You could do line.strip() - since you already do a line.strip() later, I would just put a line = line.strip() at the top of the loop and then just use line thereafter, rather than repeating the function call. 您可以做line.strip() -因为您以后已经做过line.strip() ,所以我只需要在循环的顶部放一个line = line.strip() ,然后再使用line ,而不是重复函数调用。 It's a good thought to early-out, but in this case I'm not sure that it really saves you anything, since it doesn't take the regex engine long to figure out that there's no match on a blank line. 提前考虑是个好主意,但是在这种情况下,我不确定它是否能真正为您节省任何费用,因为正则表达式引擎无需花费很长时间就能弄清空白行中没有匹配项。
Final thought: It looks like you are only expecting at most one match on a given line. 最后的想法:看起来您只期望在给定的一行中最多进行一场比赛。 If that's the case, then use search(...) rather than findall(...) . 如果是这样,请使用search(...)而不是findall(...) No need to keep looking after you've found what you wanted. 找到所需的内容后,无需继续寻找。

Regexes involve a bit of a learning curve, but they are amazingly powerful once you grok them. 正则表达式涉及一些学习曲线,但是一旦您使用它们,它们的功能将非常强大。 Keep at it! 继续吧!

Change your compile to: 将您的编译更改为:

re.compile("(.*?Memory - START UC1)")

see if that helps 看看是否有帮助

It seems to work on ideone 似乎对ideone有效

If you just want to get the word, replace the regex with: 如果您只想了解这个词,请将正则表达式替换为:

regex = compile(r'([^.* ]?Memory - START UC1)')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM