简体   繁体   English

DFA正则表达式引擎可以处理原子组吗?

[英]Can DFA regex engines handle atomic groups?

According to this page (and some others), DFA regex engines can deal with capturing groups rather well. 根据此页面 (及其他页面 ),DFA正则表达式引擎可以很好地处理捕获组。 I'm curious about atomic groups (or possessive quantifiers), as I recently used them a lot and can't imagine how this could be done. 我对原子团(或所有格量词)感到好奇,因为我最近经常使用原子团,无法想象如何做到这一点。


I disagree with the fist part of the answer: 我不同意答案的第一部分:

A DFA does not need to deal with constructs like atomic grouping.... Atomic Grouping is a way to help the engine finish a match, that would otherwise cause endless backtracking DFA不需要处理诸如原子分组之类的构造。...原子分组是一种帮助引擎完成匹配的方法,否则将导致无尽的回溯

Atomic groups are important not only for speed of NFA engines, but they also allow to write simpler and less error-prone regexes. 原子组不仅对于NFA引擎的速度很重要,而且还允许编写更简单且更不易出错的正则表达式。 Let's say I needed to find all C-style multiline comments in a program. 假设我需要在程序中找到所有C风格的多行注释。 The exact regex would be something like: 确切的正则表达式如下所示:

  • start with the literal /* 以文字/*开头
  • eat anything of the following 吃以下任何东西
    • any char except * *外的任何字符
    • a * followed by anything but / *后跟/
  • repeat this as much as possible 尽可能重复
  • end with the literal */ 以文字*/结尾

This sounds a bit complicated, the regex 这听起来有点复杂,正则表达式

/\* ( [^*] | \*[^/] )+ \*/

is complicated and wrong (it doesn't handle /* foo **/ correctly). 是复杂且错误的(它不能正确处理/* foo **/ )。 Using a reluctant (lazy) quantifier is better 使用勉强的(惰性)量词会更好

/\* .*? \*/

but also wrong as it can eat the whole line 但也错了,因为它可以吃掉整条线

/* foo */ @#$!!**@#$ /* bar */

when backtracking due to a later sub-expression failing on the garbage occurs. 当由于后面的子表达式在垃圾上失败而发生回溯时。 Putting the above in an atomic group solves the problem nicely: 将以上内容放在原子组中可以很好地解决此问题:

(?> /\* .*? \*/ )

This works always (I hope) and is as fast as possible (for NFA). 这始终有效(我希望),并且尽可能快(对于NFA)。 So I wonder if a DFA engine could somehow handle it. 因此,我想知道DFA引擎能否以某种方式处理它。

A DFA does not need to deal with constructs like atomic grouping. DFA不需要处理原子分组之类的结构。 A DFA is "text directed", unlike the NFA, which is "regex directed", in other words: Atomic Grouping is a way to help the engine finish a match, that would otherwise cause endless backtracking, as the (NFA) engine tries every permutation possible to find a match at a position, no match is even possible. DFA是“文本定向”的,而NFA是“正则表达式定向的”,换句话说:原子分组是一种帮助引擎完成匹配的方法,否则会导致无尽的回溯,因为(NFA)引擎会尝试每个排列都可能在某个位置找到匹配项,甚至不可能匹配。

Atomic grouping, simply said, throws away backtracking positions. 简而言之,原子分组会丢弃回溯位置。 Since a DFA does not backtrack (the text to be matched is checked against the regex, not the regex against the text like a NFA - the DFA opens a branch for each decision), throwing away something that is not there is pointless. 由于DFA不会回溯(将要匹配的文本与正则表达式进行检查,而不是像NFA那样针对文本进行正则表达式检查-DFA会为每个决策打开一个分支),扔掉不存在的内容是没有意义的。

I suggest JFFriedl's Mastering Regular Expressions (Google Books) , he explains the general idea of a DFA: 我建议使用JFFriedl的Mastering Regular Expressions (Google图书) ,他解释了DFA的一般概念:

DFA Engine: Text-Directed DFA引擎:文字导向

Contrast the regex-directed NFA engine with an engine that, while scanning the string, keeps track of all matches “currently in the works.” In the tonight example, the moment the engine hits t, it adds a potential match to its list of those currently in progress: 将正则表达式控制的NFA引擎与在扫描字符串时跟踪“当前正在运行”的所有匹配项的引擎进行对比。在今晚的示例中,当引擎达到t时,就会向其列表中添加潜在的匹配项目前正在进行的:

[...] [...]

Each subsequent character scanned updates the list of possible matches. 随后扫描的每个字符都会更新可能匹配的列表。 After a few more characters are matched, the situation becomes 在再匹配几个字符后,情况变为

[...] [...]

with two possible matches in the works (and one alternative, knight, ruled out). 作品中有两种可能的搭配(排除了另一种选择,骑士)。 With the g that follows, only the third alternative remains viable. 紧随其后的g,只有第三种选择仍然可行。 Once the h and t are scanned as well, the engine realizes it has a complete match and can return success. 一旦同时扫描了h和t,引擎就会意识到它具有完全匹配项并可以成功返回。

I call this “text-directed” matching because each character scanned from the text controls the engine. 我称此为“文本定向”匹配是因为从文本扫描的每个字符都控制引擎。 As in the example, a partial match might be the start of any number of different, yet possible, matches. 如示例中所示,部分匹配可能是许多不同但可能的匹配的开始。 Matches that are no longer viable are pruned as subsequent characters are scanned. 在扫描后续字符时,将删除不再可行的匹配项。 There are even situations where a “partial match in progress” is also a full match. 甚至在某些情况下,“进行中的部分比赛”也是完全比赛。 If the regex were ⌈to(…)?⌋, for example, the parenthesized expression becomes optional, but it's still greedy, so it's always attempted. 例如,如果正则表达式是⌈to(…)?⌋,则带括号的表达式变为可选,但它仍然很贪婪,因此总是尝试使用它。 All the time that a partial match is in progress inside those parentheses, a full match (of 'to') is already confirmed and in reserve in case the longer matches don't pan out. 在括号内一直进行部分匹配的所有时间,已经确认了完全匹配(“至”),并保留了完整匹配(以防更长的匹配不成功)。

(Source: http://my.safaribooksonline.com/book/programming/regular-expressions/0596528124/regex-directed-versus-text-directed/i87 ) (来源: http : //my.safaribooksonline.com/book/programming/regular-expressions/0596528124/regex-directed-versus-text-directed/i87

Concerning capturing groups and DFAs: as far as I was able to understand from your link, these approaches are not pure DFA engines but hybrids of DFA and NFA. 关于捕获组和DFA:据我从您的链接了解到,这些方法不是纯DFA引擎,而是DFA和NFA的混合体。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM