简体   繁体   English

sed和python正则表达式之间的不一致

[英]Inconsistency between sed and python regular expressions

I apologize if this is published somewhere, but my cursory searching didn't find anything. 如果这是在某个地方发布我道歉,但我粗略的搜索没有发现任何东西。

While doing some Python programming I noticed that the following command: 在做一些Python编程时,我注意到以下命令:

re.sub("a*((ab)*)b", r"\1", "aabb")

returns the empty string. 返回空字符串。 But an equivalent command in sed: 但是sed中的等效命令:

echo "aabb" | sed "s/a*\(\(ab\)*\)b/\1/"

returns ab . 返回ab

It makes sense to me that the "a*" directive at the beginning of the python regex would match both a 's, causing "(ab)*" to match zero times, but I have no idea how sed comes up with ab . 这是有道理的,我认为在“a *”指令在蟒蛇正则表达式的开始将同时匹配a的,造成‘(AB)*’匹配零次,但我不知道如何SED与出现ab Does anybody know what the difference is between the two regex engines that causes this? 有谁知道造成这种情况的两个正则表达式引擎之间有什么区别? I believe they both match stars greedily by default, but it occurred to me that sed might match from the right rather than the left. 我相信它们都默认贪婪地匹配星星,但我发现sed可能从右边而不是左边匹配。 Any insight would be greatly appreciated. 任何见解将不胜感激。

Both Python and sed are greedy by default but... Python regex tries to evaluate from left to right in all circumstances, despite of it must do eventually a backtrace to the previous state if the branch being tried can not continue by matching. 默认情况下Python和sed都是贪婪的但是...... Python正则表达式尝试在所有情况下从左到右进行评估,尽管如果正在尝试的分支无法通过匹配继续,它必须最终回溯到之前的状态。 Sed regex on the contrary are optimized before evaluating in order to prevent an unnecessary backtrace, by rewriting the regex to a more deterministic form. 相反,在评估之前优化Sed正则表达式,以便通过将正则表达式重写为更确定的形式来防止不必要的回溯。 Therefore the combined optional pattern "aab" is probably tested before the plain "a" because the most specific possible string is tried first. 因此,组合的可选模式“aab”可能在普通“a”之前进行测试,因为首先尝试了最具体的可能字符串。

Python pattern matches the string "aabb" twice as "aab" + "b" (marked between "<>") Python模式将字符串“aabb”两次匹配为“aab”+“b”(标记在“<>”之间)

>>> re.sub("a*((ab)*)b", r"<\1>", "aabb")
'<><>'

while sed matches the whole "aabb" by one substitution: 而sed通过一次替换匹配整个“aabb”:

$ echo "aabb" | sed "s/a*\(\(ab\)*\)b/<\1>/"
<ab>

Python regex backtrace algorithm is explained good in regex howto - Repeating Things in two paragraphs introduced by words "A step-by-step example...". Python正则表达式回溯算法在正则表达式中得到了很好的解释-在“一步一步的例子......”中引入的两段中重复事物 It does IMO exactly what is described regex docs : "As the target string is scanned, REs separated by '|' IMO完全按照正则表达式文档描述:“当扫描目标字符串时,RE由'|'分隔 are tried from left to right ." 从左到右尝试 。“

Demonstration 示范

The order of "(|a|aa)" btw. “(| a | aa)”btw的顺序。 "(aa|a|)" is respected by Python “(aa | a |)”受到Python的尊重

>>> re.sub("(?:|a|aa)((ab)*)b", r"<\1>", "aabb")
'<ab>'
>>> re.sub("(?:aa|a|)((ab)*)b", r"<\1>", "aabb")
'<><>'

but this order is ignored by sed because sed optimizes regular expressions. 但是这个顺序被sed忽略,因为sed优化了正则表达式。 Matching "aab" + "b" can be reproduced removing "a" option from the pattern. 匹配“aab”+“b”可以从模式中删除“a”选项。

$ echo "aabb" | sed "s/\(\|a\|aa\)\(\(ab\)*\)b/<\2>/g"
<ab>
$ echo "aabb" | sed "s/\(aa\|a\|\)\(\(ab\)*\)b/<\2>/g"
<ab>
$ echo "aabb" | sed "s/\(aa\|\)\(\(ab\)*\)b/<\2>/g"
<><>

Edit : I removed everything about DFA/NFA because I can not prove it from current texts. 编辑 :我删除了有关DFA / NFA的所有内容,因为我无法从当前文本中证明这一点。

Interesting puzzle you've constructed. 你构建的有趣的谜题。 From what I've read, the regexp engines of both python and sed are based on Henry Spencer's regex library (as is perl's), which relies on backtracking. 从我读过的文章来看,python和sed的regexp引擎都基于Henry Spencer的正则表达式库(就像perl一样),它依赖于回溯。 (Unfortunately I can't find the article I'm basing this on). (不幸的是我找不到我正在基于此的文章)。

Anyway, this is not something that's supposed to be an implementation detail: Python's behavior goes against the POSIX standard, which requires REs to (a) match at the earliest possible point, and (b) match the longest possible string that starts at that point. 无论如何,这不是一个应该是实现细节的东西:Python的行为违背了POSIX标准,它要求RE(a)尽可能匹配,(b)匹配从那个点开始的最长字符串。 (See man 7 regex (on Linux) for this and a whole lot more.) (请参阅man 7 regex (在Linux上)以及更多内容。)

To find the longest match, a backtracking ("NFA-type") regex engine must continue examining alternatives after it finds one match. 要找到最长的匹配项,回溯(“NFA类型”)正则表达式引擎必须在找到一个匹配项后继续检查备选项。 So it's not surprising that the implementers cut corners. 因此,实施者偷工减料也就不足为奇了。 Obviously, python's behavior is non-conforming since it fails to find the longest match. 显然,python的行为是不符合的,因为它找不到最长的匹配。 According to the sed manual page, sed doesn't always conform either, "for performance reasons". 根据sed手册页,sed并不总是符合“出于性能原因”。 But obviously it gets this case right. 但显然这种情况是正确的。

Incidentally, your commands are not fully equivalent: re.sub will perform a substitution as many times as possible, while sed's s/a/b/ will only perform it once.The sed version should have been: 顺便说一下,你的命令并不完全等价: re.sub会尽可能多地执行替换,而sed的s/a/b/只会执行一次.sed版本应该是:

echo "aabb" | sed "s/a*\(\(ab\)*\)b/\1/g"

This explains why we get the empty string in python: The RE matches aab the first time and the remaining b the second time, removing each part (since it's all matched by a* and the final b of the regexp). 这解释了为什么我们在python中得到空字符串:RE第一次匹配aab ,第二次匹配剩余的b ,删除每个部分(因为它全部匹配a*和正则表达式的最后一个b )。 You can see this by the following variant: 您可以通过以下变体看到此信息:

>>> re.sub("a*((ab)*)b", r"X\1Y", "aabb")
'XYXY'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM