简体   繁体   English

Findall与在Python中搜索覆盖组

[英]Findall vs search for overwriting groups in Python

I found topic Capturing group with findall? 我找到了与findall一起捕获组的主题 but unfortunately it is more basic and covers only groups that do not overwrite themselves. 但不幸的是,它更基本,只覆盖了不会覆盖自己的组。

Please let's take a look at the following example: 请让我们看下面的例子:

S = "abcabc"  # string used for all the cases below

1. Findall - no groups 1. Findall-没有团体

print re.findall(r"abc", S) # ['abc', 'abc']

General idea: No groups here so I expect findall to return a list of all matches - please confirm. 总体思路: 此处没有分组,因此我希望findall返回所有比赛的列表 -请确认。

In this case: Findall is looking for abc , finds it, returns it, then goes on and finds the second one. 在这种情况下: Findall正在寻找abc ,找到它,返回它,然后继续并找到第二个。

2. Findall - one explicit group 2. Findall-一个明确的小组

print re.findall(r"(abc)", S) # ['abc', 'abc']

General idea: Some groups here so I expect findall to return a list of all groups - please confirm. 总体思路: 这里有一些小组,所以我希望findall返回所有小组的名单 -请确认。

In this case: Why two results while there is only one group? 在这种情况下:为什么只有一组时有两个结果? I understand it this way: 我这样理解:

  • findall is looking for abc , findall正在寻找abc

  • finds it, 找到了

  • places it in the group memory buffer, 将其放置在组内存缓冲区中

  • returns it, 返回它,

  • findall starts to look for abc again, and so on... findall开始再次寻找abc ,依此类推...

Is this reasoning correct? 这个推理正确吗?

3. Findall - overwriting groups 3. Findall-覆盖组

print re.findall(r"(abc)+", S) # ['abc']

This looks similar to the above yet returns only one abc . 看起来与上面类似,但仅返回一个abc I understand it this way: 我这样理解:

  • findall is looking for abc , findall正在寻找abc

  • finds it, 找到了

  • places it in the group memory buffer, 将其放置在组内存缓冲区中

  • does not return it because the RE itself demands to go on, 返回,因为可再生能源本身要求继续进行,

  • finds another abc , 找到另一个abc

  • places it in the group memory buffer (overwrites previous abc ), 将其放置在组内存缓冲区中(覆盖以前的abc ),

  • string ends so searching ends as well. 字符串结束,因此搜索也结束。

Is this reasoning correct? 这个推理正确吗? I am very specific here so if there is anything wrong (even tiny detail) then please let me know. 我在这里非常具体,所以如果有什么问题(甚至是很小的细节),请告诉我。

4. Search - overwriting groups 4.搜索-覆盖组

Search scans through a string looking for a single match, so re.search(r"(abc)", S) and re.search(r"(abc)", S) rather obviously return only one abc , then let me get right to: Search扫描字符串以查找单个匹配项,因此re.search(r"(abc)", S)re.search(r"(abc)", S)很明显只返回一个abc ,然后让我得到权利:

re.search(r"(abc)+", S)
print m.group()  # abcabc
print m.groups() # ('abc',)

a) Of course the whole match is abcabc , but we still have groups here, so can I conclude that groups are irrelevant (despite name) for m.group() ? a)当然,整个匹配项是abcabc ,但是这里仍然有组,所以我可以得出结论,组与m.group()无关(尽管名称m.group()吗? And that is why nothing gets overwritten for this method? 这就是为什么此方法没有任何内容被覆盖?

In fact, this grouping feature of parentheses is completely unnecessary here - in such cases I just want to use parentheses to stress what needs to be taken together when repeating things without creating any regex groups. 实际上,这里的括号分组功能完全没有必要-在这种情况下,我只想使用括号来强调在重复内容而不创建任何正则表达式组时需要将哪些内容放在一起。

b) Can anyone explain a mechanism behind returning abcabc (in terms of buffers and so on) similarly like I did in bullet 3 ? b)谁能像我在第3条中一样,解释返回abcabc (在缓冲区等方面)背后的机制?

At first, let me state some facts: 首先,让我说一些事实:

  • A match value ( match.group() ) is the (sub)text that meets the whole pattern defined in a regular expression. 匹配值match.group() )是满足正则表达式中定义的整个模式的(子)文本。 Matches can contain zero or more capture groups . 匹配项可以包含零个或多个捕获组
  • A capture value ( match.group(1..n) ) is a part of the match (that can also be equal to the whole match if the whole pattern is enclosed into a capture group) that is matched with a parenthesized pattern part (a part of the pattern enclosed into a pair of unescaped parentheses). 捕获值match.group(1..n) )是匹配项的一部分(如果将整个模式包含在捕获组中,则也可以等于整个匹配项),该值与带括号的模式部分(包含在一对未转义括号中的模式的一部分)。
  • Some languages can provide access to the capture collection , ie all the values that were captured with a quantified capture group like (\\w{3})+ . 某些语言可以提供对捕获集合的访问,即使用(\\w{3})+等量化捕获组捕获的所有值。 In Python, it is possible with PyPi regex module , in .NET, with a CaptureCollection, etc. 在Python中,可以使用PyPi regex模块 ,.NET中的CaptureCollection等。

1: No groups here so I expect findall to return a list of all matches - please confirm. 1:此处没有分组,因此我希望findall返回所有匹配项的列表-请确认。

  • True, only if there are capturing groups are defined in the pattern, re.findall returns a list of captured submatches. 是的,只有在模式中定义了捕获组的情况下, re.findall返回捕获的子re.findall列表。 In case of abc , re.findall returns a list of matches. 如果是abc ,则re.findall返回匹配项列表。

2: Why two results while there is only one group? 2:为什么只有一组时有两个结果?

  • There are two matches, re.findall(r"(abc)", S) finds two matches in abcabc , and each match has one submatch, or captured substring, so the resulting array has 2 elements ( abc and abc ). 有两个匹配项, re.findall(r"(abc)", S)abcabc找到两个匹配项,每个匹配项都有一个子匹配项或捕获的子字符串,因此结果数组具有2个元素( abcabc )。

3: Is this reasoning correct? 3:这个推理正确吗?

  • The re.findall(r"(abc)+", S) is looking for a match in the form abcabcabc and so on. re.findall(r"(abc)+", S)正在以abcabcabc的形式寻找匹配abcabcabc ,依此类推。 It will match it as a whole and will keep the last abc in the capture group 1 buffer. 它将整体匹配,并将最后一个abc保留在捕获组1缓冲区中。 So, I think your reasoning is correct. 因此,我认为您的推理是正确的。 RE itself demands to go on can be precised as since the matching is not yet complete (as there are still characters for the regex engine to test for a match). RE本身继续进行的要求可以精确化, 因为匹配尚未完成 (因为正则表达式引擎中仍有字符要测试匹配项)。

4: the whole match is abcabc , but we still have groups here, so can I conclude that groups are irrelevant (despite name) for m.group() ? 4:整个匹配是abcabc ,但是这里仍然有组,所以我可以得出结论,组与m.group()无关(尽管名称m.group()

  • No, the last group value is kept in this case. 不,在这种情况下,将保留最后一个组的值。 If you change your regex to (\\w{3})+ and the string to abcedf you will feel the difference as the output for that case will be edf . 如果将正则表达式更改为(\\w{3})+ ,并将字符串更改为abcedf您会感觉有所不同,因为该情况下的输出将为edf And that is why nothing gets overwritten for this method? 这就是为什么此方法没有任何内容被覆盖? - So, you are wrong, the preceding capture group value is overwritten with the following ones. -因此,您错了,先前的捕获组值被以下值覆盖。

5: Can anyone explain a mechanism behind returning abcabc (in terms of buffers and so on) similarly like I did in bullet 3? 5:有人像我在项目符号3中一样解释了返回abcabc (在缓冲区等方面)背后的机制吗?

The re.search(r"(abc)+", S) will match abcabc ( match , not capture ) because re.search(r"(abc)+", S)将匹配abcabcmatch ,not capture ),因为

  1. abcabc is searched for abc from left to right. abcabc中搜索abc由左到右。 RE finds abc at the start and tries to find another abc right from the location after the first c . RE在开始处找到abc ,然后尝试从第一个c之后的位置开始找到另一个abc RE puts the abc into Capture group buffer 1. RE将abc放入捕获组缓冲区1。
  2. RE finds the 2nd abc , rewrites the capture group #1 buffer with it. RE找到第二个abc ,并用它重写捕获组#1缓冲区。 Tries to find another abc . 试图找到另一个abc
  3. No more abc is found - return the matched value found : abcabc . 找不到更多abc返回找到的匹配值: abcabc

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM