简体   繁体   English

Ply Lex Yacc :在某些规则中将 \\n 视为标记,否则将其忽略

[英]Ply Lex Yacc : treat \n as a token in some rules and ignore it otherwise

I'm trying to write a parser using ply in which the \\n is sometimes syntactically important, and sometimes has to be ignored.我正在尝试使用 ply 编写一个解析器,其中\\n有时在语法上很重要,有时必须被忽略。

More precisely in the language I would like to parse there are some lines corresponding to definitions that must end with a \\n to indicate the end of the definition.更准确地说,在我想解析的语言中,有一些行对应于必须以\\n结尾的定义以指示定义的结束。 In all other cases, the \\n has to be ignored and is just useful to count lines in the input file.在所有其他情况下, \\n必须被忽略并且仅用于计算输入文件中的行数。

For instance :例如 :

def1 
def2


def3

def4

Would be valid, but :将是有效的,但是:

def1 
def2 def3

def 4

Wouldn't since each def must end with a \\n不会,因为每个 def 都必须以\\n结尾

What I want is a bit similar to what we have in Python in which we can write :我想要的有点类似于我们在 Python 中可以编写的内容:

def a(b):
    if b==0:
        return (b+1)

or或者

def a(b):



    if b==0:

        return (b+1)

but

def a(b): if b==0: return (b+1)

is not allowed.不被允许。 The \\n is necessary to indicate the end of a statement but has no effect on code otherwise. \\n是指示语句结束所必需的,否则对代码没有影响。

And I don't know to reproduce a such behaviour with ply.而且我不知道用 ply 重现这种行为。 If I define a token like so :如果我像这样定义一个令牌:

def t_NEWLINE(self,t):
    r'\n+'
    t.lexer.lineno += len(t.value)
    return t

No \\n would be allowed unless the grammar explicitly allows this token to be inserted almost everywhere.除非语法明确允许在几乎所有地方插入此标记,否则不允许使用\\n

I thought about contextual grammar but there's a single context in my case.我考虑过上下文语法,但在我的情况下只有一个上下文。 I just would like to be able to use \\n both as a token is certain rules and ignored otherwise.我只是希望能够同时使用\\n作为标记是某些规则,否则会被忽略。

Is there any way of doing this ?有没有办法做到这一点?

Since Ply gives you the power of a Turing-complete programming language (Python), there certainly will be a way.由于 Ply 为您提供了图灵完备编程语言 (Python) 的强大功能,因此肯定会有一种方法。 However, it's impossible to provide much of a solution without knowing anything about the specifics of the problem.但是,如果不了解问题的具体情况,就不可能提供很多解决方案。

Lexical analysis of Python itself does require a more sophisticated strategy, which does include a small state machine (basically to eliminate newlines inside brackets, where they are ignored). Python 本身的词法分析确实需要更复杂的策略,其中包括一个小型状态机(基本上是为了消除括号内的换行符,在那里它们被忽略)。 Note that even simple Python statements must be terminated either with a newline or a semicolon, so the terminator is definitely in the grammar.请注意,即使是简单的 Python 语句也必须以换行符或分号终止,因此终止符肯定在语法中。 Typical Python lexical analysers ignore comments and blank lines;典型的 Python 词法分析器会忽略注释和空行; I could provide an example, but I don't know that it is relevant here since your language is apparently only "a bit similar to what we have in Python".我可以提供一个例子,但我不知道它在这里是否相关,因为您的语言显然只是“与我们在 Python 中拥有的有点相似”。

So I've gone out on a limb here to try to think up a use case which fits the very broad description in your question and which is relatively easy to solve in Ply.因此,我在这里尝试想出一个用例,该用例适合您问题中非常广泛的描述,并且在 Ply 中相对容易解决。 I accept that it might have no relevance at all to your use case, but it might serve for some future reader with a different but similar requirement.我接受它可能与您的用例完全无关,但它可能会为未来的读者提供不同但相似的需求。

It's actually pretty rare to encounter a language in which statements do not require any form of termination, although it is certainly not impossible.遇到一种语句不需要任何形式的终止的语言实际上是非常罕见的,尽管这当然不是不可能的。 For example, a typical language which includes例如,典型的语言包括

  1. statements which end with an expression (or expression statements, like function calls),以表达式结尾的语句(或表达式语句,如函数调用),
  2. statements which begin with an expression, again including expression statements but also keywordless assignment ( a = b rather than, for example, let a = b ),以表达式开头的语句,同样包括表达式语句,但也包括无关键字赋值( a = b而不是,例如, let a = b ),
  3. dual-purpose parentheses representing both grouping and function call arguments.表示分组和函数调用参数的两用括号。

will be ambiguous unless statements have a definite terminator.除非语句有明确的终止符,否则将是模棱两可的。 ( a(b) could be a function call (one statement) or two consecutive expression statements; similar examples can be constructed for most languages which have the above characteristics. ( a(b)可以是一个函数调用(一个语句)或两个连续的表达式语句;对于大多数具有上述特征的语言,可以构造类似的例子。

Still, all that could be surmounted with language design.尽管如此,所有这些都可以通过语言设计来克服。 The easiest such design would be to require that all statements, even function calls and assignments, start with a keyword.最简单的此类设计是要求所有语句,甚至函数调用和赋值,都以关键字开头。 Presumably in such a language, definition statements also start with a keyword and the only reason to insist on newlines around definitions is aesthetic.大概在这样的语言中,定义语句也以关键字开头,并且在定义周围坚持换行的唯一原因是美观。 (But aesthetics is fine as a reason. It's aesthetics rather than parsing limitations which leads to the Python one-line definition in your question being illegal.) (但美学是一个很好的理由。这是美学而不是解析限制导致您的问题中的 Python 单行定义是非法的。)

Suppose then, we have a language with definition statements starting with the keyword whose symbol is DEF and ending with the symbol END (otherwise, we won't know where the definition ends).假设我们有一种语言,其定义语句以符号为DEF的关键字开始,以符号END (否则,我们将不知道定义在哪里结束)。 And we'll also assume assignment statements starting with the keyword LET , which require no explicit termination.我们还将假设赋值语句以关键字LET开头,不需要显式终止。 (Of course there will be other statement types, but they will follow the same pattern as LET .) For whatever reason, we want to ensure that a DEF is always the first token on a line and an END is always the last token, which will guarantee that a definition does not horizontally coexist with any other statement, although we're comfortable with LET a = b LET c = 3 . (当然会有其他语句类型,但它们将遵循与LET相同的模式。)无论出于何种原因,我们都希望确保DEF始终是一行中的第一个标记,而END始终是最后一个标记,即将保证一个定义不会与任何其他语句水平共存,尽管我们对LET a = b LET c = 3感到满意。

One way to do this would be to ignore newlines except for the ones which precede DEF or follow END .一种方法是忽略除DEF之前或END换行符之外的换行符。 We'd then write a grammar which included:然后我们会写一个语法,其中包括:

lines       : #empty
            | lines line NEWLINE
line        : #empty
            | line simple_stmt
            | definition
definition  : DEF prototype lines END
simple_stmt : LET lhs '=' rhs

Note that the above grammar requires that the program either be empty or end with a NEWLINE.请注意,上述语法要求程序要么为空,要么以换行符结尾。

Now, to filter out the unimportant NEWLINE tokens, we can use a wrapper class around the Ply-generated lexer.现在,为了过滤掉不重要的NEWLINE标记,我们可以在 Ply 生成的词法分析器周围使用包装类。 The constructor for the wrapper takes a lexer as an argument, and filters the output stream from that lexer by removing NEWLINE tokens which are considered unimportant.包装器的构造函数将词法分析器作为参数,并通过删除被认为不重要的 NEWLINE 标记来过滤该词法分析器的输出流。 We also ensure that the input ends with a NEWLINE unless it is empty, by fabricating a NEWLINE token if necessary.我们还确保输入以 NEWLINE 结尾,除非它为空,如有必要,通过制作 NEWLINE 令牌。 (That wasn't really part of the question, but it simplifies the grammar.) (这实际上不是问题的一部分,但它简化了语法。)

# Used to fabricate a token object.
from types import SimpleNamespace

class LexerWrapper(object):

  def __init__(self, lexer):
    """Create a new wrapper given the lexer which is being wrapped"""
    self.lexer = lexer
    # None, or tuple containing queued token.
    # Using a tuple allows None (eof) to be queued.
    self.pending = None
    # Previous token or None
    self.previous = None

  def token(self):
    """Return the next token, or None if end of input has been reached"""
    # If there's a pending token, send it
    if self.pending is not None:
      t = self.pending[0]
      self.pending = None
      return t
    # Get the next (useful) token
    while True
      t = self.lexer.token()
      # Make sure that we send a NEWLINE before EOF
      if t is None:
        t, self.previous = self.previous, None
        self.pending = (None,)
        if t is not None and t.type != 'NEWLINE':
          # Manufacture a NEWLINE token if necessary
          t = SimpleNamespace( type='NEWLINE'
                             , value='\n'
                             , lineno=self.lexer.lineno
                             )
        return t
      elif t.type == 'NEWLINE':
        if self.previous is None or self.previous.type == 'NEWLINE':
          # Get another token
          continue
        if self.previous.type == 'END':
          # Use this NEWLINE if it follows END
          self.previous = None
        else:
          # Get another token
          self.previous = t
          continue
      else:
        self.previous = t
      return t

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM