简体   繁体   English

为上下文敏感的标记语言编写词法分析器,具有递归结构,例如嵌套列表

[英]Writing a lexer for a context sensitive markup language, that has recursive structures such as nested lists

I'm working on a reStructuredText transpiler in Rust, and am in need of some advice concerning how lexing should be structured in languages that have recursive structures.我正在研究 Rust 中的reStructuredText转译器,我需要一些关于如何在具有递归结构的语言中构建词法分析的建议。 For example lists within lists are possible in rST:例如,列表中的列表在 rST 中是可能的:

* This is a list item

  * This is a sub list item

* And here we are at the preceding indentation level again.

The default docutils.parsers.rst took the approach of scanning the input one line at a time:默认的docutils.parsers.rst采用一次扫描一行输入的方法:

The reStructuredText parser is implemented as a state machine, examining its input one line at a time. reStructuredText 解析器实现为 state 机器,一次检查其输入一行。

The state machine mentioned basically operates on a set of states of the form (regex, match_method, next_state) .提到的state 机器基本上在(regex, match_method, next_state)形式的一组状态上运行。 It tries to match the current line to the regex based on the current state and runs match_method while transitioning to the next_state if a match succeeds, doing this until it runs out of lines to scan.它会尝试根据当前的 state 将当前行与regex匹配,并在匹配成功时运行match_method并转换到next_state ,这样做直到用完要扫描的行。

My question then is, is this the best approach to scanning a language such as rST?那么我的问题是,这是扫描诸如 rST 之类的语言的最佳方法吗? My approach thus far has been to create a Chars iterator of the source and eat away at the source while trying to match against structures at the current Unicode scalar.到目前为止,我的方法是创建源的Chars迭代器并在尝试匹配当前 Unicode 标量的结构时从源头吃掉。 This works to some extent when all I'm doing is scanning inline content, but I've now run into the realization that handling recursive body level structures like nested lists is going to be a pain in the butt.当我所做的只是扫描内联内容时,这在某种程度上有效,但我现在已经意识到处理像嵌套列表这样的递归体级结构将是一件麻烦事。 It feels like I'm going to need a whole bunch of states with duplicate regexes and related methods in many states for matching against indentations before new lines and such.感觉就像我将需要一大堆具有重复正则表达式和相关方法的状态,以便在新行等之前匹配缩进。

Would it be better to simply have and iterator of the lines of the source and match on a per-line basis, and if a line such as简单地拥有源代码行的迭代器并在每行的基础上进行匹配会更好吗,如果像这样的行

    * this is an indented list item

is encountered in State::Body , simply transition to a state such as State::BulletList and start lexing lines based on the rules specified there?State::Body中遇到,只需转换到 state 例如State::BulletList并根据指定的规则开始 lexing 行吗? The above line could be lexed for example as a sequence例如,上面的行可以用词法解释为一个序列

TokenType::Indent, TokenType::Bullet, TokenType::BodyText

Any thoughts on this?对此有什么想法吗?

I don't know much about rST.我对rST了解不多。 But you say it has "recursive" structures.但是您说它具有“递归”结构。 If that's that case, you can't fully lex it as a recursive structure using just state machines or regexes or even lexer generators.如果是这种情况,您不能仅使用 state 机器或正则表达式甚至词法分析器生成器将其完全作为递归结构

But this the wrong way to think about it.但这是错误的思考方式。 The lexer's job is to identify the atoms of the language.词法分析器的工作是识别语言的原子。 A parser's job is to recognize structure, especially if it is recursive (yes, parsers often build trees recording the recursive structures they found).解析器的工作是识别结构,特别是如果它是递归的(是的,解析器经常构建记录它们找到的递归结构的树)。 So build the lexer ignoring context if you can, and use a parser to pick up the recursive structures if you need them.因此,如果可以的话,构建词法分析器忽略上下文,并在需要时使用解析器来获取递归结构。 You can read more about the distinction in my SO answer about Parsers Vs.您可以在我的关于 Parsers Vs 的 SO 回答中阅读更多关于区别的信息。 Lexers https://stackoverflow.com/a/2852716/120163 Lexers https://stackoverflow.com/a/2852716/120163

If you insist on doing all of this in the lexer, you'll need to augment it with a pushdown stack to track the recursive structures.如果您坚持在词法分析器中执行所有这些操作,则需要使用下推堆栈对其进行扩充以跟踪递归结构。 Then what are you building is a sloppy parser disguised as lexer.那么你正在构建的是一个伪装成词法分析器的草率解析器。 (You will probably still want a real parser to process the output of this "lexer"). (您可能仍然需要一个真正的解析器来处理这个“词法分析器”的 output)。

Having a pushdown stack actually useful if the language has different atoms in different contexts especially if the contexts nest;如果语言在不同的上下文中有不同的原子,尤其是在上下文嵌套的情况下,下推堆栈实际上很有用; in this case what you want is mode stack that you change as the lexer encounters tokens that indicate a switch from one mode to another.在这种情况下,您想要的是模式堆栈,当词法分析器遇到指示从一种模式切换到另一种模式的标记时,您会更改该模式堆栈。 A really useful extension of this idea is to have mode changes select what amounts to different lexers, each of which produces lexemes unique to that mode.这个想法的一个真正有用的扩展是让模式更改 select 相当于不同的词法分析器,每个词法分析器都会产生该模式独有的词位。

As an example you might do this to lex a language that contains embedded SQL.例如,您可以这样做来对包含嵌入式 SQL 的语言进行分类。 We build parsers for JavaScript;我们为 JavaScript 构建解析器; our lexer uses a pushdown stack to process the content of regexp literals and track nesting of {... } [...] and (... ).我们的词法分析器使用下推堆栈来处理正则表达式文字的内容并跟踪 {... } [...] 和 (... ) 的嵌套。 (This has arguably a downside: it rejects versions of JQuery.js that contain malformed regexes [yes, they exist]. Javascript doesn't care if you define a bad regex literal and never use it, but that seems pretty pointless.) (这可以说是一个缺点:它拒绝包含格式错误的正则表达式的 JQuery.js 版本 [是的,它们存在]。Javascript 不在乎您是否定义了错误的正则表达式文字并且从不使用它,但这似乎毫无意义。)

A special case of the stack occurs if you only have track single "("... ")" pairs or the equivalent.如果您只有跟踪单个“(”...“)”对或等效项,则会出现堆栈的特殊情况。 In this case you can use a counter to record how many "pushes" or "pop" you might have done on a real stack.在这种情况下,您可以使用计数器来记录您在实际堆栈上可能完成了多少“推”或“弹出”。 If you have two or more pairs of tokens like this, counters don't work.如果你有两对或更多这样的标记,计数器就不起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM