简体繁体 English

你会如何解析缩进（python风格）？

[英]How would you parse indentation (python style)?

原文 2008-12-10 16:17:25 3 2 python/ parsing/ indentation/ lexer

How would you define your parser and lexer rules to parse a language that uses indentation for defining scope. 如何定义解析器和词法分析器规则来解析使用缩进来定义范围的语言。

I have already googled and found a clever approach for parsing it by generating INDENT and DEDENT tokens in the lexer. 我已经google了一下，通过在词法分析器中生成INDENT和DEDENT令牌，找到了一种解析它的聪明方法。

I will go deeper on this problem and post an answer if I come to something interesting, but I would like to see other approaches to the problem. 如果我谈到一些有趣的东西，我会更深入地研究这个问题并发表答案，但我希望看到解决问题的其他方法。

EDIT: As Charlie pointed out, there is already another thread very similar if not the same. 编辑：正如查理指出的，如果不是相同的话，已经有另一个非常相似的线程。 Should my post be deleted? 我的帖子应该被删除吗？

2 个解决方案

This is kind of hypothetical, as it would depend on what technology you have for your lexer and parser, but the easiest way would seem to be to have BEGINBLOCK and ENDBLOCK tokens analogous to braces in C. Using the "offsides rule" your lexer needs to keep track of a stack of indendtation levels. 这是一种假设，因为它取决于你的词法分析器和解析器的技术，但最简单的方法似乎是让BEGINBLOCK和ENDBLOCK标记类似于C中的大括号。使用你的词法分析器所需的“越位规则”跟踪一堆压力水平。 When the indent level increases, emit a BEGINBLOCK for the parser; 当缩进级别增加时，为解析器发出BEGINBLOCK; when the indentation level decreases, emit ENDBLOCK and pop levels off the stack. 当缩进级别减小时，从堆栈中发出ENDBLOCK和弹出级别。

Here's another discussion of this on SO, btw. 这是关于SO 的另一个讨论，顺便说一下。

Also you can track somewhere in lexer how many ident items are preceding first line and pass it to parser. 您还可以在词法分析器中的某个位置跟踪第一行之前有多少个标识项并将其传递给解析器。 Most interesting part would be trying to pass it to parser correctly :) If your parser uses lookahead (here I mean parser may query for variable number of tokens before it really going to match even one) then trying to pass it through one global variable seems to be very bad idea (because lexer can slip on next line and change value of indent counter while parser is still trying to parse previous line). 最有趣的部分是试图正确地将它传递给解析器:)如果你的解析器使用lookahead（这里我的意思是解析器可能在它真正匹配之前查询可变数量的令牌，然后尝试通过一个全局变量传递它）是一个非常糟糕的主意（因为词法分析器可以在下一行滑动并更改缩进计数器的值，而解析器仍在尝试解析前一行）。 Also globals are evil in many other cases ;) Marking first line 'real' token in someway with indent counter is more reasonable. 在许多其他情况下，全局变量也是邪恶的;）用缩进计数器标记第一行“真实”令牌更合理。 I can't give you exact example (I don't even know what parser and lexer generators are you going to use if any...) but something like storing data on first line tokens (it could be non comfortable if you can't easily get such token from parser) or saving custom data (map that links tokens to indent, array where every line in source code as index and indent value as element value) seems to be enough. 我不能给你一个确切的例子（我甚至不知道你将使用什么解析器和lexer生成器，如果有的话......）但是就像在第一行令牌上存储数据一样（如果可以的话，它可能会不舒服）很容易从解析器获取这样的令牌）或保存自定义数据（将令牌链接到缩进的映射，源代码中的每一行作为索引和缩进值作为元素值的数组）似乎就足够了。 One downside of this approach is additional complexity to parser that will need to distinguish between ident values and change its behavior based on it. 这种方法的一个缺点是解析器的额外复杂性，需要区分ident值并基于它改变其行为。 Something like LOOKAHEAD({ yourConditionInJava }) for JavaCC may work here but it is NOT a very good idea. 像JavaCC的LOOKAHEAD（{yourConditionInJava}）这样的东西可以在这里工作，但这不是一个好主意。 A lot of additional tokens in your approach seems to be less evil thing to use :) 你的方法中有很多额外的令牌似乎不那么邪恶的东西:)

As another alternative I would suggest is to mix this two approaches. 作为另一种选择，我建议将这两种方法混合使用。 You could generate additional tokens only when indent counter changes its value on next line. 只有当缩进计数器在下一行更改其值时，才可以生成其他标记。 It is like artificial BEGIN and END token. 它就像人工BEGIN和END令牌。 In this way you may lower number of 'artificial' tokens in your stream fed into parser from lexer. 通过这种方式，您可以降低从lexer输入解析器的流中的“人工”令牌数量。 Only your parser grammar should be adjusted to understand additional tokens... 只应调整您的解析器语法以了解其他令牌...

I didn't tried this (have no real experience with such languages parsing), just sharing my thoughts about possible solutions. 我没有尝试过这个（对这些语言解析没有真正的经验），只是分享我对可能解决方案的看法。 Checking already built parsers for this kinds of languages could be of great value for you. 检查已经构建的这种语言解析器对您来说非常有价值。 Open source is your friend ;) 开源是你的朋友;）