简体   繁体   English

Python正则表达式负回顾

[英]Python regex negative lookbehind

We parse logs created by automated scripts.我们解析由自动化脚本创建的日志。 A typical thing that we'd care about is the string: '1.10.07-SNAPSHOT (1.10.07-20110303.024749-7)' from the following line:我们关心的一个典型的事情是来自以下行的字符串: '1.10.07-SNAPSHOT (1.10.07-20110303.024749-7)'

15:28:02.115 - INFO   - TestLib: Successfully retrieved build version: '1.11.11-SNAPSHOT (1.11.11-20110303.024749-7)'

The trouble is, some logs are manually created, with users entering this information themselves.问题是,有些日志是手动创建的,用户自己输入这些信息。 To remind themselves of the format they have added a dialog with the template:为了提醒自己的格式,他们添加了一个带有模板的对话框:

02:24:50.655 - INFO   - gui: Step Dialog: For test results management purposes, specify the build in which the test is executed in the following format, build version: 'specify version here'
02:25:04.905 - INFO   - gui:     Response: OK
02:25:04.905 - INFO   - gui:     Comments: 'build version: '1.11.11''

My regex for this currently is .*[Bb]uild [Vv]ersion:*\\s*(?!.*<)'?([^']*)' .我目前的正则表达式是.*[Bb]uild [Vv]ersion:*\\s*(?!.*<)'?([^']*)' The '(?!.*<)' was my first attempt to avoid this problem, because some users would write ''. '(?!.*<)'是我第一次尝试避免这个问题,因为有些用户会写 ''。 That doesn't catch the above case though.但这并没有抓住上述情况。 I think the correct thing to do is going to be a negative lookbehind that does not match if 'Step Dialog' is present on the line, but my attempts to write that seem to be failing me, according to regexr (for some reason it's not letting me share the link to my saved form).我认为正确的做法是消极回溯,如果线路上存在'Step Dialog'则不匹配,但根据regexr 的说法,我尝试编写它似乎失败了(出于某种原因,它不是让我分享到我保存的表单的链接)。 I thought negative lookbehind would look like this: (?<!Step Dialog) and result in this:我认为负回顾看起来像这样: (?<!Step Dialog)并导致:

`(?<!Step Dialog).*[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'`

but that's matching both the first and third line of the above for some reason.但出于某种原因,这与上面的第一行和第三行都匹配。

Edit:编辑:
'[Bb]', and ': \\s ' are for users who entered information in not precisely the right format by using multiple colons and spaces, capitalized 'Build'. '[Bb]' 和 ': \\s ' 适用于通过使用多个冒号和空格(大写的“Build”)以不完全正确的格式输入信息的用户。 Suggestions for cleaning this up in general are appreciated, I'm relatively new to regexs.对一般情况下的清理建议表示赞赏,我对正则表达式比较陌生。

You are close, but it is still matching because it can find a string that satisfies .* without being preceded by Step Dialog .你很接近,但它仍然匹配,因为它可以找到一个满足.*的字符串,而前面没有Step Dialog Positive and negative assertions only affect the pattern immediately surrounding them.正面和负面的断言只会影响直接围绕它们的模式。 Thus, you have to force it to check every character you don't want matching Step Dialog .因此,您必须强制它检查您不希望匹配Step Dialog每个字符。

Try this:尝试这个:

`^(?:(?!Step Dialog).)*[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'`

Now, it ensures that every character between ^ (the beginning of the line) and [Bb]uild [Vv]ersion is not the string Step Dialog .现在,它确保^ (行的开头)和[Bb]uild [Vv]ersion之间的每个字符都不是字符串Step Dialog

You'll notice I also changed it to a positive lookahead, because it's easier to understand what's going on.您会注意到我还将其更改为积极的前瞻,因为这样更容易理解正在发生的事情。

Couple ways you can do this, but you're pretty close.有几种方法可以做到这一点,但你已经很接近了。

`.*(?<!Step Dialog.*)[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'`
`^(?!.*Step Dialog).*[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'`

Chriszuma's pattern should work, too. Chriszuma 的模式也应该有效。 Use whichever you like best.使用您最喜欢的那个。 If performance is a consideration, you could benchmark the three patterns and see which is faster.如果考虑性能,您可以对三种模式进行基准测试,看看哪个更快。 My feeling is that it'll be the one starting with ``.(?)`, but I can't say for sure.我的感觉是它会以``.(?)` 开头,但我不能肯定。

Edit: As ekhumoro points out, the Python regex engine requires fixed-length lookbehinds , so the first one won't work in Python.编辑:正如 ekhumoro 指出的那样, Python 正则表达式引擎需要固定长度的 lookbehinds ,所以第一个在 Python 中不起作用。 The second one should be fine, though.不过第二个应该没问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM