简体繁体 English

是否有正则语言来表示正则表达式？

[英]Is there a regular language to represent regular expressions?

原文 2013-10-23 05:51:57 7 2 regex/ context-free-grammar/ regular-language

Specifically, I noticed that the language of regular expressions itself isn't regular.具体来说，我注意到正则表达式本身的语言不是正则的。 So, I can't use a regular expression to parse a given regular expression.所以，我不能使用正则表达式来解析给定的正则表达式。 I need to use a parser since the language of the regular expression itself is context free.我需要使用解析器，因为正则表达式本身的语言是上下文无关的。

Is there any way regular expressions can be represented in a way that the resulting string can be parsed using a regular expression?有没有什么方法可以用正则表达式解析结果字符串的方式来表示正则表达式？

Note: My question isn't about whether there is a regexp to match the current syntax of regexes, but whether there exists a "representation" for regular expressions as we know it today (maybe not a neat as what we know them as today) that can be parsed using regular expressions.注意：我的问题不是关于是否有一个正则表达式来匹配正则表达式的当前语法，而是是否存在我们今天所知道的正则表达式的“表示”（可能不像我们今天所知道的那样简洁）可以使用正则表达式解析。 Also, please could someone remove the dup since it isn't a dup.另外，请有人删除重复项，因为它不是重复项。 I'm asking something completely different.我问的是完全不同的东西。 I already know that the current language of regular expressions isn't regular (it is how I started my original question).我已经知道当前的正则表达式语言不是正则的（这就是我最初提出问题的方式）。

2 个解决方案

Depending on what you mean by "represent", the answer is "yes" or "no":根据“代表”的含义，答案是“是”或“否”：

If you want a language that (homomorphically) maps 1:1 to the usual basic regular expression language, the answer is no, because a regular language cannot be isomorphic to a non-regular language, and the standard regular expression language is non-regular.如果你想要一种（同态）映射到通常的基本正则表达式语言的语言，答案是否定的，因为正则语言不能与非正则语言同构，而标准正则表达式语言是非正则的. This is because the syntax requires matching opening and closing parentheses of arbitrary depth.这是因为语法需要匹配任意深度的左括号和右括号。

If "represent" only means another method of specifying regular languages, the answer is yes, and right now I can think of at least three ways to achieve this:如果“代表”仅意味着指定常规语言的另一种方法，那么答案是肯定的，现在我至少可以想到三种方法来实现这一点：

The "dumbest" and easiest way is to define some surjective mapping f : ℕ -> RegEx from the natural numbers onto the set of all valid standard regular expressions. “最愚蠢”和最简单的方法是定义一些满射映射f : ℕ -> RegEx从自然数到所有有效标准正则表达式的集合。 You can define the natural numbers using the regular expression 0|1[01]* , and the regular language denoted by a (string representing the) natural number n is the regular language denoted by f(n) .您可以使用正则表达式0|1[01]*定义自然数，用（表示）自然数n字符串表示的n则语言是f(n)表示的正则语言。
Of course, the meaning attached to a natural number would not be obvious to a human reader at all, so this "regular expression language" would be utterly useless.当然，自然数的意义对于人类读者来说根本不明显，所以这种“正则表达式语言”将毫无用处。
As parentheses are the only non-regular part in simple regular expressions, the easiest human-interpretable method would be to extend the standard simple regular expression syntax to allow dangling parentheses and defining semantics for dangling parentheses.由于括号是简单正则表达式中唯一的非正则部分，最简单的人类可解释的方法是扩展标准的简单正则表达式语法以允许悬空括号并定义悬空括号的语义。
The obvious choice would be to ignore non-matching opening parentheses and interpreting non-matching closing parentheses as matching the beginning of the regex.显而易见的选择是忽略不匹配的左括号并将不匹配的右括号解释为匹配正则表达式的开头。 This essentially amounts to implicitly inserting as many opening parentheses at the beginning and as many closing parentheses at the end of the regex as necessary.这基本上相当于在正则表达式的开头隐式插入尽可能多的左括号，并在正则表达式的末尾插入尽可能多的右括号。 Additionally, (* would have to be interpreted as repetition of the empty string. If I didn't miss anything, this definition should turn any string into a "regular expression" with a specified meaning, so .* defines this "regular expression language".此外， (*必须被解释为空字符串的重复。如果我没有遗漏任何东西，这个定义应该将任何字符串转换为具有指定含义的“正则表达式”，因此.*定义了这个“正则表达式语言”。
This variant even has the same abstract syntax as standard regular expressions.该变体甚至具有与标准正则表达式相同的抽象语法。
Another variant would be to specify the NFA that recognizes the language directly using a regular language, eg: ([az]+,([^,]|\\\\,|\\\\\\\\)+,[az]+\\$?;)* .另一种变体是指定直接使用常规语言识别语言的 NFA，例如： ([az]+,([^,]|\\\\,|\\\\\\\\)+,[az]+\\$?;)* 。
The idea is that [az]+ is used as a label for states, and the expression is a list of transition triples (s, c, t) from source state s to target state t consuming character c , and a $ indicating accepting transitions (cf. note below).这个想法是[az]+用作状态的标签，表达式是从源状态s到目标状态t消耗字符c的转换三元组(s, c, t)的列表，以及表示接受转换的$ （参见下面的注释）。 In c , backslashes are used to escape commas or backslashes - I assumed that you use the same alphabet for standard regular expressions, but of course you can replace the middle component with any other regular language of symbols denotating characters of any alphabet you wish.在c ，反斜杠用于转义逗号或反斜杠 - 我假设您对标准正则表达式使用相同的字母表，但当然您可以用任何其他正则语言的符号替换中间组件，这些符号表示您希望的任何字母表的字符。 The first source state mentioned is the (single) initial state.提到的第一个源状态是（单个）初始状态。 An empty expression defines the empty language.空表达式定义空语言。
Above, I wrote "accepting transition", not "accepting state" because that would make the regex above a bit more complex.上面，我写了“接受转换”，而不是“接受状态”，因为这会使上面的正则表达式更加复杂。 You can interpret a triple containing a $ as two transitions, namely one transition consuming c from s to a new, unique state, and an ε-transition from that state to t .您可以将包含$的三元组解释为两个转换，即一个转换消耗c从s到一个新的唯一状态，以及一个从该状态到t的 ε-转换。 This should allow any NFA to be represented, by replacing each transition to an accepting state with a $ triple and each transition to a non-accepting state with a non- $ triple.这应该允许任何 NFA 被表示，通过用$三元组替换每个到接受状态的转换，并用非$三元组替换每个到非接受状态的转换。

One note that might make the "yes" part look more intuitive: Assembly languages are regular, and those are even Turing-complete, so it would be unexpected if it wasn't possible to specify "mere" regular languages using a regular language.一个可能使“是”部分看起来更直观的注释：汇编语言是常规的，甚至是图灵完备的，因此如果无法使用常规语言指定“纯粹的”常规语言，那将是出乎意料的。

The answer is probably NO.答案可能是否定的。

As you have pointed out, set of all possible regular expressions itself is not a regular set.正如您所指出的，所有可能的正则表达式集本身并不是一个正则集。 Any TRUE regular expression (not those extended) can be converted into finite automata (FA).任何TRUE正则表达式（不是那些扩展的）都可以转换为有限自动机 (FA)。 If regular expression can be represented in a form that can be parsed by itself, then FA can be parsed by regular expression as well.如果正则表达式可以用自己可以解析的形式表示，那么FA也可以用正则表达式解析。

But that's not possible as far as I know.但据我所知，这是不可能的。 RE itself can be reduced into three basic operation(According to the Dragon Book): RE本身可以简化为三个基本操作（根据龙书）：

concatenation: eg ab串联：例如ab
alternation: eg a|b交替：例如a|b
kleen closure: eg a* kleen 闭合：例如a*

The kleen closure can match infinite number of characters, but it cannot know how many characters to match. kleen 闭包可以匹配无限多个字符，但它不知道要匹配多少个字符。 Just think such case: you want to match 3 consecutive a s.想想这样的情况：你想匹配 3 个连续的a s。 Then the corresponding regular expression is /aaa/ .那么对应的正则表达式是/aaa/ 。 But what if you want match 4, 5, 6... a s?但是如果你想要 match 4, 5, 6 ... a s 呢？ Parser with only one RE cannot know the exact number of a s.解析器只有一个RE无法知道确切的数字a秒。 So it fails to give the right matching to arbitrary expressions.因此它无法为任意表达式提供正确的匹配。 However, the RE parser has to match infinite different forms of REs.但是，RE 解析器必须匹配无限不同形式的 RE。 According to your expression, a regular expression cannot match all the possibilities.根据您的表达式，正则表达式无法匹配所有可能性。

Well, the only difference of a RE parser is that it does not need a tokenizer.(probably that's why RE is used in lexical analysis) Every character in RE is a token (excluding those escape charcters).嗯，RE 解析器的唯一区别是它不需要分词器。（可能这就是在词法分析中使用 RE 的原因） RE 中的每个字符都是一个标记（不包括那些转义字符）。 But to parse RE, whatever it is converted,one has to face up with NFA/DFA/TREE... all equivalent structures that cannot be parsed by RE itself.但是要解析 RE，无论它转换什么，都必须面对 NFA/DFA/TREE...所有 RE 本身无法解析的等效结构。