简体   繁体   English

如何编辑解析器以正确组合“ AND”和“ OR”谓词?

[英]How can I edit my parser to properly group “AND” and “OR” predicates?

I am currently trying to write a small parser able to parse very simple key = value queries. 我目前正在尝试编写一个小型解析器,能够解析非常简单的key = value查询。 But it should be smart enough to handle AND and OR groups, with AND having a higher precendence. 但是它应该足够聪明以处理ANDOR组,并且AND具有更高的优先级。 Example text-input: 文本输入示例:

a = 10 && b = 20
a = 10 || b = 20
a = 10 && b = 20 || c = 30

The first two are trivial. 前两个很简单。 The last should group the first two predicates as an "AND" group, and that group should then be grouped in an "OR" group. 最后一个应将前两个谓词归为“ AND”组,然后应将该组归为“ OR”组。

I have the basics down, but got stuck on the proper grouping. 我掌握了基础知识 ,但仍无法正确分组。 I am using ply which uses a flex/bison/lex/yacc syntax to define the grammar. 我正在使用使用flex / bison / lex / yacc语法定义语法的ply If I'm totally heading down a wrong track with my existing syntax please let me know... That would be a valuable learning experience concerning parsers. 如果我完全按照我现有的语法走错了路,请告诉我...这将是有关解析器的宝贵学习经验。

I've tried setting the precedence, but I don't think it's really caused by a reduce/reduce conflict. 我尝试设置优先级,但是我不认为这实际上是由减少/减少冲突引起的。 I think it's more of an issue of the way I've defined my grammar in general, but I can't figure out what I need to change. 我认为这更多是关于我总体上定义语法的方式的问题,但是我无法弄清楚需要更改什么。

Below is my current implementation and a unit-test file. 以下是我当前的实现和一个单元测试文件。 The test-file should help understanding the expected output. 测试文件应有助于理解预期的输出。 There's currently one failing test. 当前有一项测试失败。 That's the one that causes me headaches. 那是让我头疼的那件事。

The tests can be run using the builtin unittest module, but, as I execute some print statements in the tests, I suggest using pytest as it intercepts those and causes less of a mess. 可以使用内置的unittest模块运行测试,但是,由于我在测试中执行了一些print语句,因此我建议使用pytest因为它可以拦截这些print语句,并且pytest造成混乱。 For example (assuming both files are in the same folder): 例如(假设两个文件都在同一文件夹中):

python -m venv env
./env/bin/pip install pytest
./env/bin/pytest test_query_string.py

Filename: queryparser.py 文件名: queryparser.py

import logging
from collections import namedtuple

import ply.lex as lex
import ply.yacc as yacc


LOG = logging.getLogger(__name__)

Predicate = namedtuple('Predicate', 'key operator value')


class Production:

    def __repr__(self):
        preds = [repr(pred) for pred in self._predicates]
        return '%s(%s)' % (self.__class__.__name__, ', '.join(preds))

    def __eq__(self, other):
        return (
            self.__class__ == other.__class__ and
            self._predicates == other._predicates)

    def debug(self, indent=0, aslist=False):
        lines = []
        lines.append('    ' * indent + self.__class__.__name__)
        for predicate in self._predicates:
            if hasattr(predicate, 'debug'):
                lines.extend(predicate.debug(indent + 1, aslist=True))
            else:
                lines.append('    ' * (indent+1) + repr(predicate))
        if aslist:
            return lines
        else:
            return '\n'.join(lines)



class Conjunction(Production):

    def __init__(self, *predicates):
        self._predicates = predicates


class Disjunction(Production):

    def __init__(self, *predicates):
        self._predicates = predicates


def parse(query: str, debug=False) -> Predicate:
    lexer = QueryLexer().build()
    parser = QueryParser().build()
    if debug:
        output = parser.parse(query, debug=LOG)
    else:
        output = parser.parse(query)
    return output or []


class QueryLexer:

    tokens = (
        'WORD',
        'OPERATOR',
        'QUOTE',
        'AND',
        'OR'
    )

    t_ignore = ' \t'
    t_QUOTE = '"'

    def t_error(self, t):
        LOG.warning('Illegal character %r', t.value[0])
        t.lexer.skip(1)

    def t_WORD(self, t):
        r'\w+'
        return t

    def t_OPERATOR(self, t):
        r'(=|!=|>|<|<=|>=)'
        return t

    def t_AND(self, t):
        r'&&'
        return t

    def t_OR(self, t):
        r'\|\|'
        return t

    def build(self, **kwargs):
        self.lexer = lex.lex(module=self, **kwargs)


class QueryParser:

    precedence = (
        ('nonassoc', 'OR'),
        ('nonassoc', 'AND'),
    )

    def p_query(self, p):
        '''
        query : disjunction
              | conjunction
              | predicate
        '''
        p[0] = p[1]

    def p_disjunction(self, p):
        '''
        disjunction : predicate OR predicate
                    | predicate OR conjunction
                    | predicate OR disjunction
        '''
        output = [p[1]]
        if p.slice[3].type == 'disjunction':
            # We can merge multiple chanined disjunctions together
            output.extend(p[3]._predicates)
        else:
            output.append(p[3])
        p[0] = Disjunction(*output)

    def p_conjunction(self, p):
        '''
        conjunction : predicate AND predicate
                    | predicate AND conjunction
                    | predicate AND disjunction
        '''
        if len(p) == 4:
            output = [p[1]]
            if p.slice[3].type == 'conjunction':
                # We can merge multiple chanined disjunctions together
                output.extend(p[3]._predicates)
            else:
                output.append(p[3])
            p[0] = Conjunction(*output)
        else:
            p[0] = Conjunction(p[1])

    def p_predicate(self, p):
        '''
        predicate : maybequoted OPERATOR maybequoted
        '''
        p[0] = Predicate(p[1], p[2], p[3])

    def p_maybequoted(self, p):
        '''
        maybequoted : WORD
                    | QUOTE WORD QUOTE
        '''
        if len(p) == 4:
            p[0] = p[2]
        else:
            p[0] = p[1]

    def p_error(self, p):
        """
        Panic-mode rule for parser errors.
        """
        if not p:
            LOG.debug('Syntax error at EOF')
        else:
            self.parser.errok()
        LOG.error('Syntax Error at %r', p)

    def build(self):
        self.tokens = QueryLexer.tokens
        self.parser = yacc.yacc(module=self, outputdir='/tmp', debug=True)
        return self.parser

Filename: test_query_string.py 文件名: test_query_string.py

import unittest

from queryparser import parse, Conjunction, Disjunction, Predicate


class TestQueryString(unittest.TestCase):

    def test_single_equals(self):
        result = parse('hostname = foo')
        self.assertEqual(result, Predicate('hostname', '=', 'foo'))

    def test_single_equals_quoted(self):
        result = parse('hostname = "foo"')
        self.assertEqual(result, Predicate('hostname', '=', 'foo'))

    def test_anded_equals(self):
        result = parse('hostname = foo && role=cpe')
        self.assertEqual(result, Conjunction(
            Predicate('hostname', '=', 'foo'),
            Predicate('role', '=', 'cpe'),
        ))

    def test_ored_equals(self):
        result = parse('hostname = foo || role=cpe')
        self.assertEqual(result, Disjunction(
            Predicate('hostname', '=', 'foo'),
            Predicate('role', '=', 'cpe'),
        ))

    def test_chained_conjunction(self):
        result = parse('hostname = foo && role=cpe && bla=blub')
        print(result.debug())  # XXX debug statement
        self.assertEqual(result, Conjunction(
            Predicate('hostname', '=', 'foo'),
            Predicate('role', '=', 'cpe'),
            Predicate('bla', '=', 'blub'),
        ))

    def test_chained_disjunction(self):
        result = parse('hostname = foo || role=cpe || bla=blub')
        print(result.debug())  # XXX debug statement
        self.assertEqual(result, Disjunction(
            Predicate('hostname', '=', 'foo'),
            Predicate('role', '=', 'cpe'),
            Predicate('bla', '=', 'blub'),
        ))

    def test_mixed_predicates(self):
        result = parse('hostname = foo || role=cpe && bla=blub')
        print(result.debug())  # XXX debug statement
        self.assertEqual(result, Disjunction(
            Predicate('hostname', '=', 'foo'),
            Conjunction(
                Predicate('role', '=', 'cpe'),
                Predicate('bla', '=', 'blub'),
            )
        ))

    def test_mixed_predicate_and_first(self):
        result = parse('hostname = foo && role=cpe || bla=blub')
        print(result.debug())  # XXX debug statement
        self.assertEqual(result, Conjunction(
            Predicate('hostname', '=', 'foo'),
            Disjunction(
                Predicate('role', '=', 'cpe'),
                Predicate('bla', '=', 'blub'),
            )
        ))

    def test_complex(self):
        result = parse(
            'a=1 && b=2 || c=3 && d=4 || e=5 || f=6 && g=7 && h=8',
            debug=True
        )
        print(result.debug())  # XXX debug statement

        expected = Disjunction(
            Conjunction(
                Predicate('a', '=', '1'),
                Predicate('b', '=', '2'),
            ),
            Conjunction(
                Predicate('c', '=', '3'),
                Predicate('d', '=', '4'),
            ),
            Predicate('e', '=', '5'),
            Conjunction(
                Predicate('f', '=', '6'),
                Predicate('g', '=', '7'),
                Predicate('h', '=', '8'),
            ),
        )

        self.assertEqual(result, expected)

If you are using precedence declarations, both AND and OR should be declared as left , not nonassoc . 如果使用优先级声明,则ANDOR都应声明为left ,而不是nonassoc nonassoc means that a OR b OR c is illegal; nonassoc表示a OR b OR c非法; left means that it is to be interpreted as (a OR b) OR c) and right means a OR (b OR c) . left表示将被解释为(a OR b) OR c)right表示将被解释为a OR (b OR c) (Given the semantics of AND and OR , it makes no difference whether left or right is chosen, but left is generally preferable in such cases.) (鉴于ANDOR的语义,选择left还是right并没有什么区别,但是在这种情况下通常最好选择left 。)

With precedence relationships, it is possible to write an extremely simple grammar: 使用优先级关系,可以编写一个非常简单的语法:

query: predicate
     | query AND query
     | query OR query

(Usually, there would also be an entry for parenthesized expressions.) (通常,还会有一个带括号的表达式的条目。)

The above does not do the chaining you are looking for. 上面没有做您要寻找的链接。 You could do that post-parse by walking the tree, which would generally be my preference. 您可以通过走树来进行后期解析,这通常是我的偏爱。 But it also is possible to chain on the fly, using a grammar with explicit precedence. 但是,也可以使用具有明确优先级的语法动态地进行链接。

Explicit precedence means that the grammar itself defines what it possible; 显式优先意味着语法本身定义了可能的形式。 in particular, since AND binds more tightly than OR , it is not possible to have conjunction: predicate AND disjunction precisely because that production implies that the second operand to AND could be a disjunction, which is not the desired outcome. 特别是,由于AND绑定比OR绑定更紧密,因此不可能有conjunction: predicate AND disjunction恰恰是因为该产生意味着AND的第二个操作数可能是析取,这不是期望的结果。 For this case, you want the common cascading sequence: 对于这种情况,您需要通用的级联序列:

query       : disjunction  # Redundant, but possibly useful for didactic purposes
disjunction : conjunction
            | disjunction OR conjunction   # Left associative
conjunction : predicate
            | conjunction AND predicate

With that grammar, chaining is straight-forward, but it requires an explicit test as in your actions (eg., if p.slice(1).type == 'conjunction: ) which is arguably a bit ugly. 有了这种语法,链接是简单明了的,但是它需要像您的操作中那样进行显式测试(例如, if p.slice(1).type == 'conjunction: ,这可能有点难看。

Ideally, we would want to trigger the correct action directly from the grammar, which would imply something like this (which is very similar to your grammar): 理想情况下,我们希望直接从语法中触发正确的操作,这暗示着这样的事情(与您的语法非常相似):

conjunction: predicate
                # p[0] = p[1]
           | predicate AND predicate
                # p[0] = Conjunction(p[1], p[3])
           | conjunction AND predicate
                # p[0] = Conjunction(*(p[1]._predicates + [p[3]])

The problem with the above rules is that the second and the third both apply to a AND b , since after reducing a to predicate we have both the option to reduce it to conjunction or to shift the AND immediately. 上述规则的问题是,第二和第三都适用于a AND b ,因为减少后apredicate ,我们有两个将其降低到选项conjunction或转移AND马上。 In this case, we want the parser to resolve the shift-reduce conflict by unconditionally shifting, which it will do, but only after producing a warning. 在这种情况下,我们希望解析器通过无条件移位来解决移位减少冲突,这将这样做,但仅在产生警告之后才能进行。 For an explicit solution, we need the conjunction in the third rule to be a real conjunction, with at least one AND operator. 对于一个明确的解决方案,我们需要conjunction在第三个规则是一个真正的结合,与至少一个AND运营商。

With that in mind, we can shift the unit productions to the top of the cascade, resulting in the following: 考虑到这一点,我们可以将单位生产转移到级联的顶部,从而得到以下结果:

query      : disjunction
           | conjunction
           | predicate
disjunction: predicate OR predicate
           | conjunction OR predicate
           | disjunction OR predicate
conjunction: predicate AND predicate
           | conjunction AND predicate

Now we have no need for conditionals in the actions, because we know exactly what we have in every case. 现在,我们不需要在操作中使用条件,因为我们确切地知道每种情况下的条件。

def p_query(self, p):
    '''
    query : disjunction
          | conjunction
          | predicate
    '''
    p[0] = p[1]

def p_disjunction1(self, p):
    '''
    disjunction: predicate OR predicate
               | conjunction OR predicate
    '''
    p[0] = Disjunction(p[1], p[3])

def p_disjunction2(self, p):
    '''
    disjunction: disjunction OR predicate
    '''
    p[0] = Disjunction(*(p[1]._predicate + [p[3]])

def p_conjunction1(self, p):
    '''
    conjunction: predicate AND predicate
    '''
    p[0] = Conjunction(p[1], p[3])

def p_conjunction2(self, p):
    '''
    conjunction: conjunction AND predicate
    '''
    p[0] = Disjunction(*(p[1]._predicate + [p[3]])

Notes 笔记

  1. The grammar provided is fine for the case of two precedence levels, but the number of productions ends up being quadratic in the number of levels. 提供的语法适用于两个优先级别的情况,但是产生的数量最终在级别数量上是二次的。 If that is annoying, an alternative model with more unit productions: 如果这很烦人,那就是具有更多单元产量的替代模型:

     query : disjunction disjunction : conjunction | disjunction_2 disjunction_2 : conjunction OR predicate | disjunction_2 OR predicate conjunction : predicate | conjunction_2 conjunction_2 : predicate AND predicate | conjunction_2 AND predicate 
  2. If you don't insist on parser objects being immutable, you could combine both of the chaining functions ( p_conjunction2 and p_disjunction2 ) into a single function: 如果您不坚持认为解析器对象是不可变的,则可以将两个链接函数( p_conjunction2p_disjunction2 )组合为一个函数:

     def p_chain(self, p): ''' conjunction: conjunction AND predicate disjunction: disjunction OR predicate ''' p[0] = p[1] p[0]._predicate.append(p[3]) 

    Additional simplification could be achieved by making the value of the operator tokens AND and OR the constructor instead of the matched string. 附加的简化可以通过将运营商令牌的价值实现ANDOR 构造函数 ,而不是匹配的字符串。 (The matched string is really redundant, anyway.) This would allow the constructor functions ( p_disjunction1 and p_conjunction1 to also be replaced with a single function: (无论如何,匹配的字符串实际上都是多余的。)这将允许构造函数( p_disjunction1p_conjunction1也可以用单个函数替换:

     def t_AND(self, t): r'&&' t.value = Conjunction return t def t_OR(self, t): r'\\|\\|' t.value = Disjunction return t # ... def p_construct(self, p): ''' disjunction: predicate OR predicate | conjunction OR predicate conjunction: predicate AND predicate ''' p[0] = p[2](p[1], p[3]) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM