简体   繁体   English

re.findall在Python中使用分组进行正则表达式失败

[英]re.findall failing for regex with grouping in Python

Im writing a python program using regex to find email addresses. 我正在使用正则表达式编写python程序来查找电子邮件地址。 re.findall function is giving wrong output whenever I try to use round brackets for grouping. 每当我尝试使用圆括号进行分组时,re.findall函数会给出错误的输出。 Can anyone point out the mistake / suggest an alternate solution? 任何人都可以指出错误/提出替代解决方案吗?

Here are two snippets of code to explain - 以下是两段代码解释 -

pat = "[\w]+[ ]*@[ ]*[\w]+.[\w]+"
re.findall(pat, 'abc@cs.stansoft.edu.com .rtrt.. myacc@gmail.com ')

gives the output 给出输出

['abc@cs.stansoft', 'myacc@gmail.com']

However, if I use grouping in this regex and modify the code as 但是,如果我在此正则表达式中使用分组并将代码修改为

pat = "[\w]+[ ]*@[ ]*[\w]+(.[\w]+)*"
re.findall(pat, 'abc@cs.stansoft.edu.com .rtrt.. myacc@gmail.com ')

the output is 输出是

['.com', '.com']

To confirm the correctness of the regex, I tried this specific regex (in second example) in http://regexpal.com/ with the same input string, and both the email addresses are matched successfully. 为了确认正则表达式的正确性,我在http://regexpal.com/中使用相同的输入字符串尝试了这个特定的正则表达式(在第二个示例中),并且两个电子邮件地址都成功匹配。

In Python, re.findall returns the whole match only if there are no groups, if there are groups then it will return the groups. 在Python中,只有在没有组的情况下, re.findall返回整个匹配项,如果有组,则返回组。 To get around this, you should use a non-capturing group (?:...) . 要解决这个问题,您应该使用非捕获组(?:...) In this case: 在这种情况下:

pat = "[\w.]+ *@ *\w+(?:\.\w+)*"
re.findall(pat, 'abc@cs.stansoft.edu.com .rtrt.. myacc@gmail.com ')

You would use groups if you wanted to do something like separate the user from the host: 如果您想要将用户与主机分开,则可以使用组:
(The hyphens are optional, some emails have them.) (连字符是可选的,有些电子邮件有连字符。)

pat = '([\w\.-]+)@([\w\.-]+)'
re.findall(pat, 'abc@cs.stansoft.edu.com .rtrt.. myacc@gmail.com ')

Output: 输出:

[('abc', 'cs.stansoft.edu.com'), ('myacc', 'gmail.com')]

To further illustrate we could replace the host, and keep the user from group 1 (\\1): 为了进一步说明我们可以替换主机,并使用户远离组1(\\ 1):

emails = 'abc@cs.stansoft.edu.com .rtrt.. myacc@gmail.com '
pat = '([\w\.-]+)@([\w\.-]+)'
re.sub(pat, r'\1@live.com', emails)

Output: 输出:

'abc@live.com .rtrt.. myacc@live.com '

Simply remove the parentheses from the pattern to match the whole email: 只需从模式中删除括号即可匹配整个电子邮件:

pat = '[\w\.-]+@[\w\.-]+'
re.findall(pat, 'abc@cs.stansoft.edu.com .rtrt.. myacc@gmail.com ')

Output: 输出:

['abc@cs.stansoft.edu.com', 'myacc@gmail.com']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM