从解析器控制Python PLY词法分析器状态

Question

I am working on a simple SQL select like query parser and I need to be able to capture subqueries that can occur at certain places literally. 我正在研究一个简单的SQL选择，如查询解析器，我需要能够捕获可以在某些地方字面上出现的子查询。 I found lexer states are the best solution and was able to do a POC using curly braces to mark the start and end. 我发现lexer状态是最好的解决方案，并且能够使用花括号来标记开始和结束。 However, the subqueries will be delimited by parenthesis, not curlys, and the parenthesis can occur at other places as well, so I can't being the state with every open-paren. 但是，子查询将用括号分隔，而不是用卷曲分隔，括号也可以在其他地方出现，所以我不能成为每个开放状态的状态。 This information is readily available with the parser, so I was hoping to call begin and end at appropriate locations in the parser rules. 解析器随时可以使用此信息，因此我希望在解析器规则中的适当位置调用begin和end。 This however didn't work because lexer seem to tokenize the stream all at once, and so the tokens get generated in the INITIAL state. 然而这并不起作用，因为词法分析器似乎一次性标记了流，因此令牌在INITIAL状态下生成。 Is there a workaround for this problem? 这个问题有解决方法吗？ Here is an outline of what I tried to do: 以下是我尝试做的概述：

def p_value_subquery(p):
    """
     value : start_sub end_sub
    """
    p[0] = "( " + p[1] + " )"

def p_start_sub(p):
    """
    start_sub : OPAR
    """
    start_subquery(p.lexer)
    p[0] = p[1]

def p_end_sub(p):
    """
    end_sub : CPAR
    """
    subquery = end_subquery(p.lexer)
    p[0] = subquery

The start_subquery() and end_subquery() are defined like this: start_subquery（）和end_subquery（）定义如下：

def start_subquery(lexer):
    lexer.code_start = lexer.lexpos        # Record the starting position
    lexer.level = 1
    lexer.begin('subquery') 

def end_subquery(lexer):
    value = lexer.lexdata[lexer.code_start:lexer.lexpos-1]
    lexer.lineno += value.count('\n')
    lexer.begin('INITIAL')
    return value

The lexer tokens are simply there to detect the close-paren: 词法分析器令牌只是用于检测近距离：

@lex.TOKEN(r"\(")
def t_subquery_SUBQST(t):
    lexer.level += 1

@lex.TOKEN(r"\)")
def t_subquery_SUBQEN(t):
    lexer.level -= 1

@lex.TOKEN(r".")
def t_subquery_anychar(t):
    pass

I would appreciate any help. 我将不胜感激任何帮助。

Answer 1

This answer may only be partially helpful, but I would also suggest looking at section "6.11 Embedded Actions" of the PLY documentation (http://www.dabeaz.com/ply/ply.html). 这个答案可能只是部分有用，但我也建议查看PLY文档（http://www.dabeaz.com/ply/ply.html）的“6.11嵌入式操作”部分。 In a nutshell, it is possible to write grammar rules in which actions occur mid-rule. 简而言之，可以编写在规则中间发生操作的语法规则。 It would look something similar to this: 它看起来与此类似：

def p_somerule(p):
    '''somerule : A B possible_sub_query LBRACE sub_query RBRACE'''

def p_possible_sub_query(p):
    '''possible_sub_query :'''
    ...
    # Check if the last token read was LBRACE.   If so, flip lexer state
    # Sadly, it doesn't seem that the token is easily accessible. Would have to hack it
    if last_token == 'LBRACE':
        p.lexer.begin('SUBQUERY')

Regarding the behavior of the lexer, there is only one token of lookahead being used. 关于词法分析器的行为，只使用了一个前瞻标记。 So, in any particular grammar rule, at most only one extra token has been read already. 因此，在任何特定的语法规则中，最多只读取了一个额外的令牌。 If you're going to flip lexer states, you need to make sure that it happens before the token gets consumed by the parser, but before the parser asks to read the next incoming token. 如果您要翻转词法分析器状态，则需要确保它在解析器使用令牌之前发生，但在解析器要求读取下一个传入令牌之前。

Also, if possible, I would try to stay out of the yacc() error handling stack as far as a solution. 另外，如果可能的话，我会尝试远离yacc（）错误处理堆栈，直到解决方案。 There is way too much black-magic going on in error handling--the more you can avoid it, the better. 在错误处理方面有太多的黑魔法 - 你越能避免它，越好。

I'm a bit pressed for time at the moment, but this seems to be something that could be investigated for the next version of PLY. 我现在有点紧张，但这似乎是可以调查下一版PLY的东西。 Will put it on my to-do list. 将它放在我的待办事项清单上。

Answer 2

Based on PLY author's response, I came up with this better solution. 基于PLY作者的回应，我提出了这个更好的解决方案。 I am yet to figure out how to return the subquery as a token, but the rest looks much better and need not be considered a hack anymore. 我还没弄明白如何将子查询作为一个标记返回，但其余的看起来要好得多，不再需要被认为是黑客了。

def start_subquery(lexer):
    lexer.code_start = lexer.lexpos        # Record the starting position
    lexer.level = 1
    lexer.begin("subquery")

def end_subquery(lexer):
    lexer.begin("INITIAL")

def get_subquery(lexer):
    value = lexer.lexdata[lexer.code_start:lexer.code_end-1]
    lexer.lineno += value.count('\n')
    return value

@lex.TOKEN(r"\(")
def t_subquery_OPAR(t):
    lexer.level += 1

@lex.TOKEN(r"\)")
def t_subquery_CPAR(t):
    lexer.level -= 1
    if lexer.level == 0:
        lexer.code_end = lexer.lexpos        # Record the ending position
        return t

@lex.TOKEN(r".")
def t_subquery_anychar(t):
    pass

def p_value_subquery(p):
    """
    value : check_subquery_start OPAR check_subquery_end CPAR
    """
    p[0] = "( " + get_subquery(p.lexer) + " )"

def p_check_subquery_start(p):
    """
    check_subquery_start : 
    """
    # Here last_token would be yacc's lookahead.
    if last_token.type == "OPAR":
        start_subquery(p.lexer)

def p_check_subquery_end(p):
    """
    check_subquery_end : 
    """
    # Here last_token would be yacc's lookahead.
    if last_token.type == "CPAR":
        end_subquery(p.lexer)

last_token = None

def p_error(p):
    global subquery_retry_pos
    if p is None:
        print >> sys.stderr, "ERROR: unexpected end of query"
    else:
        print >> sys.stderr, "ERROR: Skipping unrecognized token", p.type, "("+ \
                p.value+") at line:", p.lineno, "and column:", find_column(p.lexer.lexdata, p)
        # Just discard the token and tell the parser it's okay.
        yacc.errok()

def get_token():
    global last_token
    last_token = lexer.token()
    return last_token

def parse_query(input, debug=0):
    lexer.input(input)
    return parser.parse(input, tokenfunc=get_token, debug=0)

Answer 3

Since nobody has an answer, it bugged me to find a workaround, and here is an ugly hack using the error recovery and restart(). 由于没有人有答案，它让我找不到解决方法，这是一个使用错误恢复和重启（）的丑陋黑客。

def start_subquery(lexer, pos):
    lexer.code_start = lexer.lexpos        # Record the starting position
    lexer.level = 1
    lexer.begin("subquery") 
    lexer.lexpos = pos

def end_subquery(lexer):
    value = lexer.lexdata[lexer.code_start:lexer.lexpos-1]
    lexer.lineno += value.count('\n')
    lexer.begin('INITIAL')
    return value

@lex.TOKEN(r"\(")
def t_subquery_SUBQST(t):
    lexer.level += 1

@lex.TOKEN(r"\)")
def t_subquery_SUBQEN(t):
    lexer.level -= 1
    if lexer.level == 0:
        t.type = "SUBQUERY"
        t.value = end_subquery(lexer)
        return t

@lex.TOKEN(r".")
def t_subquery_anychar(t):
    pass

# NOTE: Due to the nature of the ugly workaround, the CPAR gets dropped, which
# makes it look like there is an imbalance.
def p_value_subquery(p):
    """
     value : OPAR SUBQUERY
    """
    p[0] = "( " + p[2] + " )"

subquery_retry_pos = None

def p_error(p):
    global subquery_retry_pos
    if p is None:
        print >> sys.stderr, "ERROR: unexpected end of query"
    elif p.type == 'SELECT' and parser.symstack[-1].type == 'OPAR':
        lexer.input(lexer.lexdata)
        subquery_retry_pos = parser.symstack[-1].lexpos
        yacc.restart()
    else:
        print >> sys.stderr, "ERROR: Skipping unrecognized token", p.type, "("+ \
                p.value+") at line:", p.lineno, "and column:", find_column(p.lexer.lexdata, p)
        # Just discard the token and tell the parser it's okay.
        yacc.errok()

def get_token():
    global subquery_retry_pos
    token = lexer.token()
    if token and token.lexpos == subquery_retry_pos:
        start_subquery(lexer, lexer.lexpos)
        subquery_retry_pos = None
    return token

def parse_query(input, debug=0):
    lexer.input(inp)
    result = parser.parse(inp, tokenfunc=get_token, debug=0)

从解析器控制Python PLY词法分析器状态

问题描述

3 个解决方案

解决方案1
5 2012-03-28 11:37:21

解决方案2
2 已采纳

解决方案3
1

从解析器控制Python PLY词法分析器状态

问题描述

3 个解决方案

解决方案1 5 2012-03-28 11:37:21

解决方案2 2 已采纳

解决方案3 1

解决方案1
5 2012-03-28 11:37:21

解决方案2
2 已采纳

解决方案3
1