简体   繁体   中英

Regular Expression formal documention

Is there any formal documentation of how to implement your own regular expression library? What formal documentation, if any, did the makers of the exisiting regular expression libriaries base their code on?

I have written (and abandoned) a javascript parser, including regular expression support. I based it on the ECMAscript definition, which according to itself uses Perl5 regular expressions. This is ECMA 262, I used the 3rd edition, from december 1999. (There is a newer one by now, I don't know if it is as complete in its definition of regular expressions.)

Any good textbook on automata theory and/or compiler construction, eg Hopcroft and Ullman , covers regular expressions and their relation to finite-state automata, to which they can be compiled. So do several textbooks on natural language processing, where finite-state methods are commonly used, eg Jurafsky and Martin .

(There was even a course by Ullman himself on Coursera , but a new session is yet to be announced.)

As for the question what documentation current RE libraries are based on: on textbooks like the one I cited and existing implementations. The first RE implementation that I'm aware of is the one in Ken Thompson's version of QED , ca. 1967. Unfortunately, the tech report on the QED website cites very few references and none related to RE/FA theory. I'm sure the ideas ultimately trace back to Kleene's theory of regular languages, which was developed in the 1950s.

Regular expressions are called regular because that's a property of the state machine they're a representation of. Simply put, a possible implementation might use state machines which are just tables. The regex parser would create a number of states and transitions for a regex, executing it goes through the states according to the transitions.

eg /ab+/ generates something like:

state \ next char:  a       b       $       *
[initial state]     goto 1  fail    fail    fail
1                   fail    goto 2  fail    fail
2                   fail    goto 2  match   fail

(where $ is the end of the string, * is any other character)

I have been searching for regular expression, and have found an intresting and as far as I see realting question about them Question: Why can regular expressions have an exponential running time?

The accepted answer suggests based on the linked articles that RegExp implemetnations (also used in Perl) are a "bit" slower, and there is a faster/simpler algorithm for them, used by many good old Unix tools like grep.

This link directly leads to the mentioned article: Regular Expression Matching Can Be Simple And Fast , ( part2 , part3 )

If it's actual, you should take it into the cosideration using this algorithm rather than Perl's.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM