简体   繁体   English

简单的解析器,但不是计算器

[英]Simple parser, but not a calculator

I am trying to write a very simple parser. 我正在尝试编写一个非常简单的解析器。 I read similar questions here on SO and on the Internet, but all I could find was limited to "arithmetic like" things. 我在SO和Internet上读过类似的问题,但我所能找到的仅限于“算术式”的事物。

I have a very simple DSL, for example: 我有一个非常简单的DSL,例如:

ELEMENT TYPE<TYPE> elemName {
    TYPE<TYPE> memberName;
}

Where the <TYPE> part is optional and valid only for some types. 其中<TYPE>部分是可选的,并且仅对某些类型有效。

Following what I read, I tried to write a recursive descent parser in Python, but there are a few things that I can't seem to understand: 按照阅读的内容,我尝试用Python编写递归下降解析器,但是有些事情我似乎无法理解:

  1. How do I look for tokens that are longer than 1 char? 我如何寻找超过1个字符的令牌?
  2. How do I break up the text in the different parts? 如何分解文本的不同部分? For example, after a TYPE I can have a whitespace or a < or a whitespace followed by a < . 例如,在TYPE之后,我可以有一个空格或<或一个空格,后跟一个< How do I address that? 我该如何解决?

Short answer 简短答案

All your questions boil down to the fact that you are not tokenizing your string before parsing it. 您所有的问题都归结为您在解析字符串之前没有对字符串进行标记的事实。

Long answer 长答案

The process of parsing is actually split in two distinct parts: lexing and parsing . 解析过程实际上分为两个不同的部分: lexingparsing

Lexing 乐兴

What seems to be missing in the way you think about parsing is called tokenizing or lexing. 您对解析的思考方式似乎缺少的东西称为标记化或词法化。 It is the process of converting a string into a stream of tokens, ie words. 这是将字符串转换为令牌(即单词)流的过程。 That is what you are looking for when asking How do I break up the text in the different parts? 这是您在询问如何将文本拆分成不同部分时要寻找的内容?

You can do it by yourself by checking your string against a list of regexp using re , or you can use some well-known librairy such as PLY . 您可以通过使用re对一个regexp列表检查字符串来自己完成操作,也可以使用一些知名的库,例如PLY Although if you are using Python3, I will be biased toward a lexing-parsing librairy that I wrote, which is ComPyl . 尽管如果您使用的是Python3,我会偏向于我编写的词法分析库,即ComPyl

So proceeding with ComPyl, the syntax you are looking for seems to be the following. 因此,继续使用ComPyl,您正在寻找的语法如下。

from compyl.lexer import Lexer

rules = [
    (r'\s+', None),
    (r'\w+', 'ID'),
    (r'< *\w+ *>', 'TYPE'), # Will match your <TYPE> token with inner whitespaces
    (r'{', 'L_BRACKET'),
    (r'}', 'R_BRACKET'),
]

lexer = Lexer(rules=rules, line_rule='\n')
# See ComPyl doc to figure how to proceed from here

Notice that the first rule (r'\\s+', None) , is actually what solves your issue about whitespace. 注意,第一个规则(r'\\s+', None)实际上是解决您有关空白的问题的方法。 It basically tells the lexer to match any whitespace character and to ignore them. 它基本上告诉词法分析器匹配任何空白字符并忽略它们。 Of course if you do not want to use a lexing tool, you can simply add a similar rule in your own re implementation. 当然,如果您不想使用词法分析工具,则可以在自己的re实现中简单地添加一个类似的规则。

Parsing 解析中

You seem to want to write your own LL(1) parser, so I will be brief on that part. 您似乎想编写自己的LL(1)解析器,因此我将在这一部分进行简要介绍。 Just know that there exist a lot of tools that can do that for you (PLY and ComPyl librairies offer LR(1) parsers which are more powerful but harder to hand-write, see the difference between LL(1) and LR(1) here ). 只需知道有很多工具可以为您做到这一点(PLY和ComPyl库提供了LR(1)解析器,这些解析器功能更强大但更难手写,请参阅LL(1)和LR(1)之间的区别。 在这里 )。

Simply notice that now that you know how to tokenize your string, the issue of How do I look for tokens that are longer than 1 char? 只需注意,既然您知道如何对字符串进行标记,那么我如何查找长度超过1个字符的标记? has been solved. 已经解决了。 You are now parsing, not a stream of characters, but a stream of tokens that encapsulate the matched words . 您现在要解析的不是字符流,而是封装匹配的标记流。

Olivier's answer regarding lexing/tokenizing and then parsing is helpful. Olivier关于词汇化/标记化然后解析的答案很有帮助。

However, for relatively simple cases, some parsing tools are able to handle your kind of requirements without needing a separate tokenizing step. 但是,在相对简单的情况下,某些解析工具可以处理您的各种需求,而无需单独的标记化步骤。 parsy is one of those. 麻痹就是其中之一。 You build up parsers from smaller building blocks - there is good documentation to help. 您可以从较小的构建块构建解析器-有很好的文档可以提供帮助。

An example of a parser done with parsy for your kind of grammar is here: http://parsy.readthedocs.io/en/latest/howto/other_examples.html#proto-file-parser . 以下是针对您的语法使用parsy进行解析的示例: http ://parsy.readthedocs.io/en/latest/howto/other_examples.html#proto-file-parser。 It is significantly more complex than yours, but shows what is possible. 它比您的复杂得多,但是显示了可行的方法。 Where whitespace is allowed (but not required), it uses the lexeme utility (defined at the top) to consume optional whitespace. 在允许(但不是必须)空格的地方,它使用lexeme实用程序(在顶部定义)来使用可选的空格。

You may need to tighten up your understanding of where whitespace is necessary and where it is optional, and what kind of whitespace you really mean. 您可能需要加深对哪些空格是必需的,哪些空格是可选的以及您真正指的是哪种空格的了解。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM