简体   繁体   English

python - 正则表达式搜索和findall

[英]python - regex search and findall

I need to find all matches in a string for a given regex. 我需要在给定正则表达式的字符串中找到所有匹配项。 I've been using findall() to do that until I came across a case where it wasn't doing what I expected. 我一直在使用findall()来做到这一点,直到我遇到一个没有达到我预期的情况。 For example: 例如:

regex = re.compile('(\d+,?)+')
s = 'There are 9,000,000 bicycles in Beijing.'

print re.search(regex, s).group(0)
> 9,000,000

print re.findall(regex, s)
> ['000']

In this case search() returns what I need (the longest match) but findall() behaves differently, although the docs imply it should be the same: 在这种情况下, search()返回我需要的内容(最长匹配)但findall()行为有所不同,尽管文档暗示它应该是相同的:

findall() matches all occurrences of a pattern, not just the first one as search() does. findall()匹配模式的所有出现,而不是像search()那样匹配第一个模式。

  • Why is the behaviour different? 为什么行为不同?

  • How can I achieve the result of search() with findall() (or something else)? 如何用findall() (或其他东西)实现search()的结果?

Ok, I see what's going on... from the docs: 好的,我从文档中看到了...

If one or more groups are present in the pattern, return a list of groups; 如果模式中存在一个或多个组,则返回组列表; this will be a list of tuples if the pattern has more than one group. 如果模式有多个组,这将是一个元组列表。

As it turns out, you do have a group, "(\\d+,?)"... so, what it's returning is the last occurrence of this group, or 000. 事实证明,你有一个组,“(\\ d +,?)”......所以,它返回的是该组的最后一次出现,或000。

One solution is to surround the entire regex by a group, like this 一种解决方案是围绕整个正则表达式,像这样

regex = re.compile('((\d+,?)+)')

then, it will return [('9,000,000', '000')], which is a tuple containing both matched groups. 然后,它将返回[('9,000,000','000')],这是一个包含两个匹配组的元组。 of course, you only care about the first one. 当然,你只关心第一个。

Personally, i would use the following regex 就个人而言,我会使用以下正则表达式

regex = re.compile('((\d+,)*\d+)')

to avoid matching stuff like " this is a bad number 9,123," 避免匹配“这是一个坏号码9,123”之类的东西

Edit. 编辑。

Here's a way to avoid having to surround the expression by parenthesis or deal with tuples 这是一种避免用括号括起表达式或处理元组的方法

s = "..."
regex = re.compile('(\d+,?)+')
it = re.finditer(regex, s)

for match in it:
  print match.group(0)

finditer returns an iterator that you can use to access all the matches found. finditer返回一个迭代器,您可以使用它来访问找到的所有匹配项。 these match objects are the same that re.search returns, so group(0) returns the result you expect. 这些匹配对象与re.search返回的相同,因此group(0)返回您期望的结果。

@aleph_null's answer correctly explains what's causing your problem, but I think I have a better solution. @ aleph_null的答案正确解释了导致问题的原因,但我认为我有更好的解决方案。 Use this regex: 使用这个正则表达式:

regex = re.compile(r'\d+(?:,\d+)*')

Some reasons why it's better: 为什么它更好的一些原因:

  1. (?:...) is a non-capturing group, so you only get the one result for each match. (?:...)是一个非捕获组,因此您只能获得每个匹配的一个结果。

  2. \\d+(?:,\\d+)* is a better regex, more efficient and less likely to return false positives. \\d+(?:,\\d+)*是一个更好的正则表达式,更高效,并且不太可能返回误报。

  3. You should always use Python's raw strings for regexes if possible; 如果可能的话,你应该总是使用Python的原始字符串作为正则表达式; you're less likely to be surprised by regex escape sequences (like \\b for word boundary ) being interpreted as string-literal escape sequences (like \\b for backspace ). 你不太可能对正则表达式转义序列(比如\\b表示单词边界 )感到惊讶,因为它被解释为字符串文字转义序列(如\\b for Backspace )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM