简体   繁体   English

Python中的正则表达式:(\\ w)+的​​search()vs findall()

[英]Regular Expression in Python: search() vs findall() for (\w)+

I have created a regular expression as: 我创建了一个正则表达式为:

agentRegex = re.compile(r'Agent (\w)+')

And then I performed search() operation as: 然后,我执行search()操作为:

agentRegex.search('Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.').group()

I obtained 'Agent Alice' as output. 我获得了'Agent Alice'作为输出。

But when I performed findall() operation: 但是当我执行findall()操作时:

agentRegex.findall('Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.') 

The output was ['e', 'l', 'e', 'b'] . 输出为['e', 'l', 'e', 'b']

Shouldn't the output be ['Alice Agent', 'Agent Carol', 'Agent Eve', 'Agent Bob'] ? 输出不应该是['Alice Agent', 'Agent Carol', 'Agent Eve', 'Agent Bob']吗?

re.findall() by default outputs a list of captured groups, in your case (\\w+) . 默认情况下, re.findall()输出已捕获组的列表,在您的情况下(\\w+)

Get rid of the captured group: 摆脱捕获的组:

Agent \w+

Example: 例:

>>> agentRegex = re.compile(r'Agent \w+')

>>> agentRegex.findall('Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.') 
['Agent Alice', 'Agent Carol', 'Agent Eve', 'Agent Bob']

Your regex: 您的正则表达式:

'Agent (\w)+'

It will keep matching and capturing single \\w characters after 'Agent ' and will keep overwriting the matched group with the next match. 它会继续匹配并捕获'Agent '之后'Agent '单个\\w字符,并会在下次匹配时继续覆盖匹配的组。 Thats how you get ['e', 'l', 'e', 'b'] which are the last characters of ['Alice', 'Carol', 'Eve', 'Bob'] 那就是您如何获得['e', 'l', 'e', 'b']后缀,这些字符是['Alice', 'Carol', 'Eve', 'Bob']

You got correct answer in .search().group() because group() defaults to group(0) which contains everything that got matched, but if you do .search().group(1) you will get ['e'] . 您在.search().group()得到正确答案,因为group()默认为group(0) ,其中包含所有已匹配的内容,但是如果您执行.search().group(1) ,则将获得['e']

What you are looking for is capture the Agent as well as next word. 您正在寻找的是捕获代理以及下一个单词。 So yo u can try like heemayl and Dietrich suggested. 因此,您可以像heemayl和Dietrich建议的那样尝试。

You could do this too: 您也可以这样做:

import re
agentRegex = re.compile(r'Agent\s+[^\s]+')
print agentRegex.findall('Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.') 
# ['Agent Alice', 'Agent Carol', 'Agent Eve', 'Agent Bob']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM