简体   繁体   中英

Is there a regular language to represent regular expressions?

Specifically, I noticed that the language of regular expressions itself isn't regular. So, I can't use a regular expression to parse a given regular expression. I need to use a parser since the language of the regular expression itself is context free.

Is there any way regular expressions can be represented in a way that the resulting string can be parsed using a regular expression?

Note: My question isn't about whether there is a regexp to match the current syntax of regexes, but whether there exists a "representation" for regular expressions as we know it today (maybe not a neat as what we know them as today) that can be parsed using regular expressions. Also, please could someone remove the dup since it isn't a dup. I'm asking something completely different. I already know that the current language of regular expressions isn't regular (it is how I started my original question).

Depending on what you mean by "represent", the answer is "yes" or "no":

If you want a language that (homomorphically) maps 1:1 to the usual basic regular expression language, the answer is no, because a regular language cannot be isomorphic to a non-regular language, and the standard regular expression language is non-regular. This is because the syntax requires matching opening and closing parentheses of arbitrary depth.

If "represent" only means another method of specifying regular languages, the answer is yes, and right now I can think of at least three ways to achieve this:

  1. The "dumbest" and easiest way is to define some surjective mapping f : ℕ -> RegEx from the natural numbers onto the set of all valid standard regular expressions. You can define the natural numbers using the regular expression 0|1[01]* , and the regular language denoted by a (string representing the) natural number n is the regular language denoted by f(n) .

    Of course, the meaning attached to a natural number would not be obvious to a human reader at all, so this "regular expression language" would be utterly useless.

  2. As parentheses are the only non-regular part in simple regular expressions, the easiest human-interpretable method would be to extend the standard simple regular expression syntax to allow dangling parentheses and defining semantics for dangling parentheses.

    The obvious choice would be to ignore non-matching opening parentheses and interpreting non-matching closing parentheses as matching the beginning of the regex. This essentially amounts to implicitly inserting as many opening parentheses at the beginning and as many closing parentheses at the end of the regex as necessary. Additionally, (* would have to be interpreted as repetition of the empty string. If I didn't miss anything, this definition should turn any string into a "regular expression" with a specified meaning, so .* defines this "regular expression language".

    This variant even has the same abstract syntax as standard regular expressions.

  3. Another variant would be to specify the NFA that recognizes the language directly using a regular language, eg: ([az]+,([^,]|\\\\,|\\\\\\\\)+,[az]+\\$?;)* .

    The idea is that [az]+ is used as a label for states, and the expression is a list of transition triples (s, c, t) from source state s to target state t consuming character c , and a $ indicating accepting transitions (cf. note below). In c , backslashes are used to escape commas or backslashes - I assumed that you use the same alphabet for standard regular expressions, but of course you can replace the middle component with any other regular language of symbols denotating characters of any alphabet you wish. The first source state mentioned is the (single) initial state. An empty expression defines the empty language.

    Above, I wrote "accepting transition", not "accepting state" because that would make the regex above a bit more complex. You can interpret a triple containing a $ as two transitions, namely one transition consuming c from s to a new, unique state, and an ε-transition from that state to t . This should allow any NFA to be represented, by replacing each transition to an accepting state with a $ triple and each transition to a non-accepting state with a non- $ triple.

One note that might make the "yes" part look more intuitive: Assembly languages are regular, and those are even Turing-complete, so it would be unexpected if it wasn't possible to specify "mere" regular languages using a regular language.

The answer is probably NO.

As you have pointed out, set of all possible regular expressions itself is not a regular set. Any TRUE regular expression (not those extended) can be converted into finite automata (FA). If regular expression can be represented in a form that can be parsed by itself, then FA can be parsed by regular expression as well.

But that's not possible as far as I know. RE itself can be reduced into three basic operation(According to the Dragon Book):

  1. concatenation: eg ab
  2. alternation: eg a|b
  3. kleen closure: eg a*

The kleen closure can match infinite number of characters, but it cannot know how many characters to match. Just think such case: you want to match 3 consecutive a s. Then the corresponding regular expression is /aaa/ . But what if you want match 4, 5, 6... a s? Parser with only one RE cannot know the exact number of a s. So it fails to give the right matching to arbitrary expressions. However, the RE parser has to match infinite different forms of REs. According to your expression, a regular expression cannot match all the possibilities.

Well, the only difference of a RE parser is that it does not need a tokenizer.(probably that's why RE is used in lexical analysis) Every character in RE is a token (excluding those escape charcters). But to parse RE, whatever it is converted,one has to face up with NFA/DFA/TREE... all equivalent structures that cannot be parsed by RE itself.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM