简体   繁体   English

获取正则表达式的所有可能匹配项(在python中)?

[英]Get all possible matches for regex (in python)?

I have a regex that can match a string in multiple overlapping possible ways. 我有一个可以以多种可能的重叠方式匹配字符串的正则表达式。 However, it seems to only capture one possible match in the string, how can I get all possible matches? 但是,它似乎只能捕获字符串中的一个可能的匹配项,如何获得所有可能的匹配项? I've tried finditer with no success, but maybe I'm using it wrong. 我尝试过finditer并没有成功,但是也许我用错了。

The string I'm trying to parse is: 我要解析的字符串是:

foo-foobar-foobaz

The regex I'm using is: 我正在使用的正则表达式是:

(.*)-(.*)

>>> s = "foo-foobar-foobaz"
>>> matches = re.finditer(r'(.*)-(.*)', s)
>>> [match.group(1) for match in matches]
['foo-foobar']

I want the match (foo and foobar-foobaz), but it seems to only get (foo-foobar and foobaz). 我想要匹配(foo和foobar-foobaz),但似乎只能得到(foo-foobar和foobaz)。

No problem: 没问题:

>>> regex = "([^-]*-)(?=([^-]*))"
>>> for result in re.finditer(regex, "foo-foobar-foobaz"):
>>>     print("".join(result.groups()))
foo-foobar
foobar-foobaz

By putting the second capturing parenthesis in a lookahead assertion , you can capture its contents without consuming it in the overall match. 通过将第二个捕获括号放入前瞻断言中 ,您可以捕获其内容,而无需在整体匹配中使用它。

I've also used [^-]* instead of .* because the dot also matches the separator - which you probably don't want. 我还用[^-]*代替了.*因为该点还与分隔符匹配-您可能不希望这样。

It's not something regex engines tend to be able to do. 正则表达式引擎往往无法做到这一点。 I don't know if Python can. 我不知道Python是否可以。 Perl can using the following: Perl可以使用以下内容:

local our @matches;
"foo-foobar-foobaz" =~ /
    ^(.*)-(.*)\z
    (?{ push @matches, [ $1, $2 ] })
    (*FAIL)
/xs;

This specific problem can probably be solved using the regex engine in many languages using the following technique: 使用以下技术,可以使用多种语言的正则表达式引擎来解决此特定问题:

my @matches;
while ("foo-foobar-foobaz" =~ /(?=-(.*)\z)/gsp) {
   push @matches, [ ${^PREMATCH}, $1 ];
}

( ${^PREMATCH} refers to what comes before where the regex matched, and $1 refers to what the first () matched.) ${^PREMATCH}表示正则表达式匹配之前的内容, $1表示第一个()匹配的内容。)

But you can easily solve this specific problem outside the regex engine: 但是您可以在正则表达式引擎之外轻松解决此特定问题:

my @parts = split(/-/, "foo-foobar-foobaz");
my @matches;
for (1..$#parts) {
   push @matches, [
      join('-', @parts[0..$_-1]),
      join('-', @parts[$_..$#parts]),
   ];
}

Sorry for using Perl syntax, but should be able to get the idea. 很抱歉使用Perl语法,但应该可以理解。 Translations to Python welcome. 欢迎翻译成Python。

If you want to detect overlapping matches, you'll have to implement it yourself - essentially, for a string foo 如果要检测重叠的匹配项,则必须自己实现-本质上是对于字符串foo

  1. Find the first match that starts at string index i 查找从字符串索引i开始的第一个匹配项
  2. Run the matching function again against foo[i+1:] 再次对foo[i+1:]运行匹配功能
  3. Repeat steps 1 and 2 on the incrementally short remaining portion of the string. 在字符串的剩余部分逐渐变短上重复步骤1和2。

It gets trickier if you're using arbitrary-length capture groups (eg (.*) ) because you probably don't want both foo-foobar and oo-foobar as matches, so you'd have to do some extra analysis to move i even farther than just +1 each match; 如果您使用任意长度的捕获组(例如(.*) ),它将变得更加棘手,因为您可能不希望同时使用foo-foobaroo-foobar作为匹配项,因此您必须进行一些额外的分析才能移动i甚至比每场比赛都+1 you'd need to move it the entire length of the first captured group's value, plus one. 您需要将其移动到第一个捕获组值的整个长度,再加上一个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM