无需手动行走/检查即可制作词法分析器

Question

I'm making my own programming language and I'm on the lexer right now.我正在制作我自己的编程语言，我现在正在使用词法分析器。 My current approach is to manually walk through the code and check for valid keywords, then append a Token object to a tokens array.我目前的方法是手动遍历代码并检查有效关键字，然后 append 一个Token object 到tokens数组。 But it leaves me with a massive if/else statement that's not only ugly but slow too.但这给我留下了一个巨大的 if/else 语句，它不仅丑陋而且速度也很慢。 I'm struggling to find any resources about this online, and I'm trying to find out if there's a better way to do this - Some regex pattern or something?我正在努力寻找有关此在线的任何资源，并且我正在尝试找出是否有更好的方法来执行此操作 - 一些正则表达式模式或其他什么？

Here's the code这是代码

class Token:
  def __init__(self, type, value):
    self.type = type
    self.value = value

  def __str__(self):
    return f'Token({self.type}, {self.value})'

  def __repr__(self):
    return self.__str__()


def lex(code):
  tokens = []

  for index in range(len(code)):
    pass # This is where the if/else statement goes

  return tokens

I don't want to use lex or anything.我不想使用 lex 或任何东西。 Thanks in advance for the help.先谢谢您的帮助。

Answer 1

Parser generators can help you get started quickly by helping you define syntax trees and giving you a declarative syntax to describe the lexing & parsing steps.解析器生成器可以帮助您定义语法树并为您提供描述词法分析和解析步骤的声明性语法，从而帮助您快速入门。

that's not only ugly but slow too这不仅丑陋，而且速度也很慢

This seems odd to me.这对我来说似乎很奇怪。 Hand-rolled lexers usually are pretty performant.手动词法分析器通常性能非常好。 As long as your syntax doesn't require too much lookahead or back-tracking.只要您的语法不需要太多的前瞻或回溯。

Parser generators typically work based on automata;解析器生成器通常基于自动机工作； they build state tables so most of the work is just a loop that at each steps looks up into those tables.他们构建了 state 个表，因此大部分工作只是一个循环，在每个步骤中查找这些表。

One trick that high-performance, hand-rolled lexers often do is to have a lookup-table that classifies each ASCII character.高性能、手动词法分析器经常使用的一个技巧是拥有一个对每个 ASCII 字符进行分类的查找表。 So the lexing loop looks like所以词法循环看起来像

while position < limit:
  code_point = read_codepoint(position)
  if code_point <= MAX_ASCII:
    # switch on CLASSIFICATION[code_point]
  else:
    # Do something else probably identifier related

where CLASSIFICATION stores info that lets you recognize that quote characters inevitably lead to parsing as a quoted string or character literal and space characters can be skipped over and 0-9 inevitably lead to parsing a numeric token.其中CLASSIFICATION存储信息，让您认识到引号字符不可避免地导致解析为带引号的字符串或字符文字，空格字符可以被跳过，0-9 不可避免地导致解析数字标记。

Some regex pattern or something?一些正则表达式模式或什么？

This can work if your lexical grammar is regular .如果您的词汇语法是规则的，这可以工作。

That probably isn't true if your syntax requires nesting tokens.如果您的语法需要嵌套标记，那可能不是真的。

For example, JS has non-regularity because template strings can embed expressions:例如，JS 具有非正则性，因为模板字符串可以嵌入表达式：

`string stuff ${ expressionStuff } more string stuff`

so a JS lexer needs to keep state so it knows when a } transitions back into a string state or not.所以 JS 词法分析器需要保留 state 以便它知道}何时转换回字符串 state。

无需手动行走/检查即可制作词法分析器

问题描述

1 个解决方案

解决方案1
0 2023-01-04 16:44:22

无需手动行走/检查即可制作词法分析器

问题描述

1 个解决方案

解决方案1 0 2023-01-04 16:44:22

解决方案1
0 2023-01-04 16:44:22