简体   繁体   English

EBNF / parboiled:如何将正则表达式翻译成PEG?

[英]EBNF / parboiled: how to translate regexp into PEG?

This is a question both specific to the parboiled parser framework, and to BNF/PEG in general. 这是一个特定于parboiled解析器框架和一般BNF / PEG的问题。

Let's say I have the fairly simple regular expression 假设我有一个相当简单的正则表达式

^\\s*([A-Za-z_][A-Za-z_0-9]*)\\s*=\\s*(\\S+)\\s*$

which represents the pseudo-EBNF of 代表伪EBNF

<line>               ::= <ws>? <identifier> <ws>? '=' <nonwhitespace> <ws>?
<ws>                 ::= (' ' | '\t' | {other whitespace characters})+
<identifier>         ::= <identifier-head> <identifier-tail>
<identifier-head>    ::= <letter> | '_'    
<identifier-tail>    ::= (<letter> | <digit> | '_')*
<letter>             ::= ('A'..'Z') | ('a'..'z')
<digit>              ::= '0'..'9'
<nonwhitespace>      ::= ___________

How would you define nonwhitespace (one or more characters that aren't whitespace) in EBNF? 如何在EBNF中定义非空白(一个或多个不是空格的字符)?

For those of you familiar with the Java parboiled library, how could you implement a rule that defines nonwhitespace? 对于熟悉Java parboiled库的人,如何实现定义非空白的规则?

You are stuck with the conventions of your lexical generator for specifying character ranges and operations on character ranges. 您仍然坚持使用词法生成器的约定来指定字符范围和字符范围上的操作。

Many lexer generators accept hex values (something like 0x) to represent characters, so you might write: 许多词法分析器生成器接受十六进制值(类似于0x)来表示字符,因此您可以编写:

 '0'..'9'
 0x30..\0x39

for digits. 对于数字。

For nonwhitespace, you need to know which character set you are using. 对于非空白,您需要知道您正在使用哪个字符集。 For 7 bit ASCII, nonwhitespace is conceptually all the printing characters: 对于7位ASCII,非空白在概念上是所有打印字符:

0x21..\0x7E

For ISO8859-1: 对于ISO8859-1:

( 0x21..\0x7E | 0x80-0xFF )

You can decide for yourself if the character codes above 0x80 are spaces or not (is non-breaking space a space?). 你可以自己决定0x80以上的字符代码是否为空格(空间是不间断的空间?)。 You also get to decide about the status of the control characters 0x0..0x1F. 您还可以决定控制字符0x0..0x1F的状态。 Is tab (0x9) a whitespace character? 标签(0x9)是一个空白字符? How about CR 0xD and LF 0xA? CR 0xD和LF 0xA怎么样? How about the ETB control character? ETB控制字符怎么样?

Unicode is harder, because its a huge set, and your list gets big and messy. Unicode更难,因为它是一个巨大的集合,你的列表变得庞大而混乱。 C'est la vie . C'est la vie Our DMS Software Reengineering Toolkit is used to build parsers for a wide variety of languages, and has to support lexers for ASCII, ISO8859-z for lots of z's, and Unicode. 我们的DMS软件再造工具包用于构建各种语言的解析器,并且必须支持ASCII,ISO8859-z的词法分析器以及许多z和Unicode。 Rather than write complicated "additive" regular expression ranges, DMS allows subtractive regular expressions, and so we can write: DMS不是编写复杂的“加法”正则表达式范围,而是允许减法正则表达式,因此我们可以编写:

 <UniCodeLegalCharacters>-<UniCodeWhiteSpace>

which is much easier to understand and gets it right on the first try. 这更容易理解,并在第一次尝试时正确。

In EBNF I would simply define nonwhitespace as any character that isn't whitespace: 在EBNF中,我只是将非空白定义为任何不是空格的字符:

nonwhitespace ::= anycharacter - whitespace

This requires that you have a 'anycharacter' literal that defines the entire range of possible symbols, and a clear definition of which characters are whitespace. 这要求您有一个'anycharacter'文字,用于定义可能符号的整个范围,以及明确定义哪些字符是空格。

In Parboiled you can do this using the TestNot and ANY Rules, so for example nonwhitespace would be defined as any character which doesn't match the WhiteSpace() Rule: TestNot ,您可以使用TestNotANY规则执行此操作,因此例如nonwhitespace将被定义为与WhiteSpace()规则不匹配的任何字符:

Sequence( TestNot(WhiteSpace()) , ANY )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM