简体   繁体   中英

Parsing logical expressions

I have a task where I have to filter a Pandas DataFrame based on user specified logical expression. Now, I've seen a module called PyParser or LARK which I would like to use but I cannot seem to figure out how to set them up.

I have several operators like CONTAINS , EQUAL , FUZZY_MATCH etc. Also, I'd like to combine some expressions into more complex ones.

Example expression:

ColumnA CONTAINS [1, 2, 3] AND (ColumnB FUZZY_MATCH 'bla' OR ColumnC EQUAL 45)

As a result, I'd like to have some structured Dict or List with levels of operations in order of how to execute them. So, the desired result for this example expression would be something like:

[['ColumnA', 'CONTAINS', '[1, 2, 3]'], 'AND', [['ColumnB', 'FUZZY_MATCH', 'bla'], OR, ['ColumnC', 'EQUAL', '45']]]

or in form of dict:

{
  'EXPR1': {
    'col': 'ColumnA', 
    'oper': 'CONTAINS', 
    'value': '[1, 2, 3]']
  },
  'OPERATOR': 'AND', 
  'EXPR2': {
    'EXPR21': {
      'col': 'ColumnB', 
      'oper': 'FUZZY_MATCH', 
      'value': 'bla'
    }, 
    'OPERATOR': OR, 
    'EXPR22': {
      'col': 'ColumnC', 
      'oper': 'EQUAL', 
      'value': '45'
    }
  }
}

Or something like that. If you have some better way of structuring the result, I'm open for suggestions. I'm pretty new to this so I'm fairly certain this can be improved.

Interesting problem:)

Seems like a relatively straightforward application of the shunting yard algorithm.
I had written code to parse expressions like "((20 - 10 ) * (30 - 20) / 10 + 10 ) * 2" over here .

import re


def tokenize(str):
   return re.findall("[+/*()-]|\d+", expression)

def is_number(str):
    try:
        int(str)
        return True
    except ValueError:
        return False


def peek(stack):
    return stack[-1] if stack else None


def apply_operator(operators, values):
    operator = operators.pop()
    right = values.pop()
    left = values.pop()
    values.append(eval("{0}{1}{2}".format(left, operator, right)))


def greater_precedence(op1, op2):
    precedences = {"+": 0, "-": 0, "*": 1, "/": 1}
    return precedences[op1] > precedences[op2]


def evaluate(expression):
    tokens = tokenize(expression)
    values = []
    operators = []
    for token in tokens:
        if is_number(token):
            values.append(int(token))
        elif token == "(":
            operators.append(token)
        elif token == ")":
            top = peek(operators)
            while top is not None and top != "(":
                apply_operator(operators, values)
                top = peek(operators)
            operators.pop()  # Discard the '('
        else:
            # Operator
            top = peek(operators)
            while top is not None and top != "(" and greater_precedence(top, token):
                apply_operator(operators, values)
                top = peek(operators)
            operators.append(token)
    while peek(operators) is not None:
        apply_operator(operators, values)

    return values[0]


def main():
    expression = "((20 - 10 ) * (30 - 20) / 10 + 10 ) * 2"
    print(evaluate(expression))


if __name__ == "__main__":
    main()

I reckon we can modify the code slightly to make it work for your case:

  1. We need to modify the way in which we tokenize the input string in tokenize() .
    Basically, given the string ColumnA CONTAINS [1, 2, 3] AND (ColumnB FUZZY_MATCH 'bla' OR ColumnC EQUAL 45) , we want a list of tokens:
    ['ColumnA', 'CONTAINS', '[1, 2, 3]', 'AND', '(', 'ColumnB', 'FUZZY_MATCH', "'bla'", 'OR', 'ColumnC', 'EQUAL', '45', ')'] .
    This would highly depend on how complex the input string can be and would require some string processing, but its fairly simple and I'll leave this to you.
  2. Modify the is_number() function to rather detect things like ColumnA , [1, 2, 3] etc.
    Basically, everything apart from predicates CONTAINS / FUZZY_MATCH / EQUAL , operators AND / OR and parantheses ( / ) .
  3. Modify greater_precedence(op1, op2) to return true in case op1 is among ['CONTAINS', 'EQUAL', ..] and op2 is ['AND', 'OR'] .
    This is because we want the contains and equals to be always evaluated before AND / OR .
  4. Modify apply_operator(operators, values) to implement logic of how to evaluate the boolean expression ColumnA CONTAINS [1, 2, 3] or the expression true AND false .
    Remember that CONTAINS / FUZZY_MATCH / EQUAL / AND / OR etc all are operators here.
    Probably you'll need to write a lot of if-else cases here as there can be a lot of different operators.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM