简体   繁体   English

为什么我的正则表达式会为字符串中的每个字符返回元组?

[英]Why does my regular expression return tuples for every character in a string?

I am making a simple project for my math class in which I want to verify if a given function body (string) only contains the allowed expressions (digits, basic trigonometry, +, -, *, /).我正在为我的数学 class 制作一个简单的项目,我想在其中验证给定的 function 主体(字符串)是否仅包含允许的表达式(数字、基本三角函数、+、-、*、/)。 I am using regular expressions with the re.findall method.我正在使用带有re.findall方法的正则表达式。 My current code:我当前的代码:

import re

def valid_expression(exp) -> bool:
    # remove white spaces
    exp = exp.replace(" ", "")

    # characters to search for
    chars = r"(cos)|(sin)|(tan)|[\d+/*x)(-]"

    z = re.findall(chars, exp)
    
    return "".join(z) == exp

However, when I test this any expression the re.findall(chars, exp) will return a list of tuples with 3 empty strings: ('', '', '') for every character in the string unless there is a trig function in which case it will return a tuple with the trig function and two empty strings.但是,当我测试这个任何表达式时, re.findall(chars, exp)将返回一个包含 3 个空字符串的元组列表: ('', '', '')对于字符串中的每个字符,除非存在触发 function在这种情况下,它将返回一个带有触发器 function 和两个空字符串的元组。

Ex: cos(x) -> [('cos', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]例如:cos(x) -> [('cos', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]

I don't understand why it does this, I have tested the regular expression on regexr.com and it works fine.我不明白为什么会这样,我已经在regexr.com上测试了正则表达式,它工作正常。 I get that it uses javascript but normally there should be no difference right?我知道它使用 javascript 但通常应该没有区别吧?

Thank you for any explanation and/or fix.感谢您的任何解释和/或修复。

Short answer: If the result you want is ['cos', '(', 'x', ')'] , you need something like '(cos|sin|tan|[)(-*x]|\d+)' :简短的回答:如果你想要的结果是['cos', '(', 'x', ')'] ,你需要类似'(cos|sin|tan|[)(-*x]|\d+)' :

>>> re.findall(r'(cos|sin|tan|[)(-*x]|\d+)', "cos(x)")
['cos', '(', 'x', ')']

From the documentation for findall :findall的文档中:

The result depends on the number of capturing groups in the pattern.结果取决于模式中捕获组的数量。 If there are no groups, return a list of strings matching the whole pattern.如果没有组,则返回与整个模式匹配的字符串列表。 If there is exactly one group, return a list of strings matching that group.如果只有一个组,则返回与该组匹配的字符串列表。 If multiple groups are present, return a list of tuples of strings matching the groups.如果存在多个组,则返回与组匹配的字符串元组列表。 Non-capturing groups do not affect the form of the result.非捕获组不影响结果的形式。

For 'cos(x)' , you start with ('cos', '', '') because cos matched, but neither sin nor tan matched.对于'cos(x)' ,您从('cos', '', '')开始,因为cos匹配,但sintan都不匹配。 For each of ( , x , and ) , none of the three capture groups matched, although the bracket expression did.对于(x)中的每一个,三个捕获组都没有匹配,尽管括号表达式匹配。 Since it isn't inside a capture group, anything it matches isn't included in your output.由于它不在捕获组内,因此它匹配的任何内容都不包含在您的 output 中。

As an aside, [\d+/*x)(-] doesn't include multidigit integers as a match. \d+ is not a regular expression; it's the two characters d and + . (The escape is a no-op, since d has no special meaning inside [...] .) As a result, it matches exactly one of the following eight characters:顺便说一句, [\d+/*x)(-]不包括多位整数作为匹配项。 \d+不是正则表达式;它是两个字符d+ 。(转义是空操作,因为d[...]中没有特殊含义。)因此,它与以下八个字符之一完全匹配:

  1. d
  2. +
  3. /
  4. *
  5. x
  6. )
  7. (
  8. -

You have three groups (an expression with parentheses) in your regex, so you get tuples with three items.您的正则表达式中有三个组(一个带括号的表达式),因此您得到包含三个项目的元组。 Also you get four results for all substrings that matches with your regex: first for 'cos', second for '(', third for 'x', and the last for ')'.此外,您会得到与您的正则表达式匹配的所有子字符串的四个结果:第一个是“cos”,第二个是“(”,第三个是“x”,最后一个是“)”。 But the last part of your regex doesn't marked as a group, so you don't get this matches in your tuple.但是你的正则表达式的最后一部分没有标记为一个组,所以你没有在你的元组中得到这个匹配。 If you change your regex like r"(cos)|(sin)|(tan)|([\d+/*x)(-])" you will get tuples with four items.如果您像 r"(cos)|(sin)|(tan)|([\d+/*x)(-])" 更改您的正则表达式,您将获得包含四个项目的元组。 And every tuple will have one non empty item.每个元组都有一个非空项。

Unfortunately, this fix doesn't help you to verify that you have no prohibited lexemes.不幸的是,此修复程序无法帮助您验证您没有被禁止的词位。 It's just to understand what's going on.这只是为了了解发生了什么。

I would suggest you to convert your regex to a negative form: you may check that you have no anything except allowed lexemes instead of checking that you have some allowed ones.我建议您将您的正则表达式转换为否定形式:您可以检查除了允许的词位之外什么都没有,而不是检查您是否有一些允许的词位。 I guess this way should work for simple cases.我想这种方式应该适用于简单的情况。 But, I am afraid, for more sophisticated expression you have to use something other than regex.但是,恐怕,对于更复杂的表达,你必须使用正则表达式以外的东西。

findall returns tuples because your regular expression has capturing groups. findall 返回元组,因为您的正则表达式具有捕获组。 To make a group non-capturing, add ?: after the opening parenthesis:要使组不捕获,请在左括号后添加?: ::

r"(?:cos)|(?:sin)|(?:tan)|[\d+/*x)(-]"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM