简体   繁体   中英

Why does my regular expression return tuples for every character in a string?

I am making a simple project for my math class in which I want to verify if a given function body (string) only contains the allowed expressions (digits, basic trigonometry, +, -, *, /). I am using regular expressions with the re.findall method. My current code:

import re

def valid_expression(exp) -> bool:
    # remove white spaces
    exp = exp.replace(" ", "")

    # characters to search for
    chars = r"(cos)|(sin)|(tan)|[\d+/*x)(-]"

    z = re.findall(chars, exp)
    
    return "".join(z) == exp

However, when I test this any expression the re.findall(chars, exp) will return a list of tuples with 3 empty strings: ('', '', '') for every character in the string unless there is a trig function in which case it will return a tuple with the trig function and two empty strings.

Ex: cos(x) -> [('cos', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]

I don't understand why it does this, I have tested the regular expression on regexr.com and it works fine. I get that it uses javascript but normally there should be no difference right?

Thank you for any explanation and/or fix.

Short answer: If the result you want is ['cos', '(', 'x', ')'] , you need something like '(cos|sin|tan|[)(-*x]|\d+)' :

>>> re.findall(r'(cos|sin|tan|[)(-*x]|\d+)', "cos(x)")
['cos', '(', 'x', ')']

From the documentation for findall :

The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.

For 'cos(x)' , you start with ('cos', '', '') because cos matched, but neither sin nor tan matched. For each of ( , x , and ) , none of the three capture groups matched, although the bracket expression did. Since it isn't inside a capture group, anything it matches isn't included in your output.

As an aside, [\d+/*x)(-] doesn't include multidigit integers as a match. \d+ is not a regular expression; it's the two characters d and + . (The escape is a no-op, since d has no special meaning inside [...] .) As a result, it matches exactly one of the following eight characters:

  1. d
  2. +
  3. /
  4. *
  5. x
  6. )
  7. (
  8. -

You have three groups (an expression with parentheses) in your regex, so you get tuples with three items. Also you get four results for all substrings that matches with your regex: first for 'cos', second for '(', third for 'x', and the last for ')'. But the last part of your regex doesn't marked as a group, so you don't get this matches in your tuple. If you change your regex like r"(cos)|(sin)|(tan)|([\d+/*x)(-])" you will get tuples with four items. And every tuple will have one non empty item.

Unfortunately, this fix doesn't help you to verify that you have no prohibited lexemes. It's just to understand what's going on.

I would suggest you to convert your regex to a negative form: you may check that you have no anything except allowed lexemes instead of checking that you have some allowed ones. I guess this way should work for simple cases. But, I am afraid, for more sophisticated expression you have to use something other than regex.

findall returns tuples because your regular expression has capturing groups. To make a group non-capturing, add ?: after the opening parenthesis:

r"(?:cos)|(?:sin)|(?:tan)|[\d+/*x)(-]"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM