简体   繁体   English

解析逻辑表达式

[英]Parsing logical expressions

I have a task where I have to filter a Pandas DataFrame based on user specified logical expression.我有一个任务,我必须根据用户指定的逻辑表达式过滤 Pandas DataFrame。 Now, I've seen a module called PyParser or LARK which I would like to use but I cannot seem to figure out how to set them up.现在,我看到了一个我想使用的名为 PyParser 或 LARK 的模块,但我似乎不知道如何设置它们。

I have several operators like CONTAINS , EQUAL , FUZZY_MATCH etc. Also, I'd like to combine some expressions into more complex ones.我有几个运算符,如CONTAINSEQUALFUZZY_MATCH等。另外,我想将一些表达式组合成更复杂的表达式。

Example expression:示例表达式:

ColumnA CONTAINS [1, 2, 3] AND (ColumnB FUZZY_MATCH 'bla' OR ColumnC EQUAL 45)

As a result, I'd like to have some structured Dict or List with levels of operations in order of how to execute them.因此,我希望有一些结构化的 Dict 或 List 具有操作级别的操作顺序,以便执行它们。 So, the desired result for this example expression would be something like:因此,此示例表达式的预期结果将类似于:

[['ColumnA', 'CONTAINS', '[1, 2, 3]'], 'AND', [['ColumnB', 'FUZZY_MATCH', 'bla'], OR, ['ColumnC', 'EQUAL', '45']]]

or in form of dict:或以字典的形式:

{
  'EXPR1': {
    'col': 'ColumnA', 
    'oper': 'CONTAINS', 
    'value': '[1, 2, 3]']
  },
  'OPERATOR': 'AND', 
  'EXPR2': {
    'EXPR21': {
      'col': 'ColumnB', 
      'oper': 'FUZZY_MATCH', 
      'value': 'bla'
    }, 
    'OPERATOR': OR, 
    'EXPR22': {
      'col': 'ColumnC', 
      'oper': 'EQUAL', 
      'value': '45'
    }
  }
}

Or something like that.或类似的东西。 If you have some better way of structuring the result, I'm open for suggestions.如果您有更好的方法来构建结果,我愿意接受建议。 I'm pretty new to this so I'm fairly certain this can be improved.我对此很陌生,所以我相当肯定这可以改进。

Interesting problem:)有趣的问题:)

Seems like a relatively straightforward application of the shunting yard algorithm.似乎是调车场算法的一个相对简单的应用。
I had written code to parse expressions like "((20 - 10 ) * (30 - 20) / 10 + 10 ) * 2" over here .我在这里编写了代码来解析像"((20 - 10 ) * (30 - 20) / 10 + 10 ) * 2" 这样的表达式。

import re


def tokenize(str):
   return re.findall("[+/*()-]|\d+", expression)

def is_number(str):
    try:
        int(str)
        return True
    except ValueError:
        return False


def peek(stack):
    return stack[-1] if stack else None


def apply_operator(operators, values):
    operator = operators.pop()
    right = values.pop()
    left = values.pop()
    values.append(eval("{0}{1}{2}".format(left, operator, right)))


def greater_precedence(op1, op2):
    precedences = {"+": 0, "-": 0, "*": 1, "/": 1}
    return precedences[op1] > precedences[op2]


def evaluate(expression):
    tokens = tokenize(expression)
    values = []
    operators = []
    for token in tokens:
        if is_number(token):
            values.append(int(token))
        elif token == "(":
            operators.append(token)
        elif token == ")":
            top = peek(operators)
            while top is not None and top != "(":
                apply_operator(operators, values)
                top = peek(operators)
            operators.pop()  # Discard the '('
        else:
            # Operator
            top = peek(operators)
            while top is not None and top != "(" and greater_precedence(top, token):
                apply_operator(operators, values)
                top = peek(operators)
            operators.append(token)
    while peek(operators) is not None:
        apply_operator(operators, values)

    return values[0]


def main():
    expression = "((20 - 10 ) * (30 - 20) / 10 + 10 ) * 2"
    print(evaluate(expression))


if __name__ == "__main__":
    main()

I reckon we can modify the code slightly to make it work for your case:我认为我们可以稍微修改代码以使其适用于您的情况:

  1. We need to modify the way in which we tokenize the input string in tokenize() .我们需要修改在tokenize()中对输入字符串进行标记的方式。
    Basically, given the string ColumnA CONTAINS [1, 2, 3] AND (ColumnB FUZZY_MATCH 'bla' OR ColumnC EQUAL 45) , we want a list of tokens:基本上,给定字符串ColumnA CONTAINS [1, 2, 3] AND (ColumnB FUZZY_MATCH 'bla' OR ColumnC EQUAL 45) ,我们需要一个标记列表:
    ['ColumnA', 'CONTAINS', '[1, 2, 3]', 'AND', '(', 'ColumnB', 'FUZZY_MATCH', "'bla'", 'OR', 'ColumnC', 'EQUAL', '45', ')'] . ['ColumnA', 'CONTAINS', '[1, 2, 3]', 'AND', '(', 'ColumnB', 'FUZZY_MATCH', "'bla'", 'OR', 'ColumnC', 'EQUAL', '45', ')']
    This would highly depend on how complex the input string can be and would require some string processing, but its fairly simple and I'll leave this to you.这在很大程度上取决于输入字符串的复杂程度,并且需要一些字符串处理,但它相当简单,我将把它留给你。
  2. Modify the is_number() function to rather detect things like ColumnA , [1, 2, 3] etc.修改is_number() function 以检测ColumnA[1, 2, 3]等内容。
    Basically, everything apart from predicates CONTAINS / FUZZY_MATCH / EQUAL , operators AND / OR and parantheses ( / ) .基本上,除了谓词CONTAINS / FUZZY_MATCH / EQUAL 、运算符AND / OR和括号( / )之外的所有内容。
  3. Modify greater_precedence(op1, op2) to return true in case op1 is among ['CONTAINS', 'EQUAL', ..] and op2 is ['AND', 'OR'] .如果op1['CONTAINS', 'EQUAL', ..]之间并且op2['AND', 'OR']则修改greater_precedence(op1, op2)以返回 true 。
    This is because we want the contains and equals to be always evaluated before AND / OR .这是因为我们希望始终在AND / OR之前评估containsequals
  4. Modify apply_operator(operators, values) to implement logic of how to evaluate the boolean expression ColumnA CONTAINS [1, 2, 3] or the expression true AND false .修改apply_operator(operators, values)以实现如何评估 boolean 表达式ColumnA CONTAINS [1, 2, 3]或表达式true AND false的逻辑。
    Remember that CONTAINS / FUZZY_MATCH / EQUAL / AND / OR etc all are operators here.请记住,这里的CONTAINS / FUZZY_MATCH / EQUAL / AND / OR等都是运算符。
    Probably you'll need to write a lot of if-else cases here as there can be a lot of different operators.可能您需要在这里编写很多 if-else 案例,因为可能有很多不同的运算符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM