简体   繁体   English

如何将正则表达式与多个重叠模式匹配?

[英]How to match regex with multiple overlapping patterns?

The context上下文

I have a string made of mixed mp3 information that I must try to match against a pattern made of arbitrary strings and tokens.我有一个由混合 mp3 信息组成的字符串,我必须尝试匹配由任意字符串和标记组成的模式。 It works like that:它是这样工作的:

  1. The program shows the user a given string该程序向用户显示给定的字符串

the Beatles_Abbey_Road-SomeWord-1969 The Beatles_Abbey_Road-SomeWord-1969

  1. User enter a pattern to help program parse the string用户输入一个模式来帮助程序解析字符串

the %Artist_%Album-SomeWord-%Year %Artist_%Album-SomeWord-%Year

  1. Then I'd like to show results of the matches (but need your help for that)然后我想显示比赛的结果(但需要你的帮助)

2 possible matches found:找到 2 个可能的匹配项:
[1] {'Artist': 'Beatles', 'Album':'Abbey_Road', 'Year':1969} [1] {'艺术家':'披头士','专辑':'Abbey_Road','年份':1969}
[2] {'Artist': 'Beatles_Abbey', 'Album':'Road', 'Year':1969} [2] {'艺术家':'Beatles_Abbey','专辑':'Road','Year':1969}

The problem问题

As an example, let say pattern is artist name followed by title (delimiter: '-').例如,假设模式是艺术家姓名后跟标题(分隔符:'-')。

Example 1:示例 1:

>>> artist = 'Bob Marley'
>>> title = 'Concrete Jungle'
>>> re.findall(r'(.+)-(.+)', '%s-%s' % (artist,title))
[('Bob Marley', 'Concrete Jungle')]

So far, so good.到目前为止,一切都很好。 But...但...
I have no control over the delimiter used and have no guarantee that it's not present in the tags, so trickier cases exist:我无法控制使用的分隔符,也无法保证它不存在于标签中,因此存在更棘手的情况:

Example 2:示例 2:

>>> artist = 'Bob-Marley'
>>> title = 'Roots-Rock-Reggae'
>>> re.findall(r'(.+)-(.+)', '%s-%s' % (artist,title))
[('Bob-Marley-Roots-Rock', 'Reggae')]

As expected, it doesn't work in that case.正如预期的那样,它在这种情况下不起作用

How can I generate all possible combinations of artist/title?如何生成艺术家/标题的所有可能组合?

[('Bob', 'Marley-Roots-Rock-Reggae'),
 ('Bob-Marley', 'Roots-Rock-Reggae')
 ('Bob-Marley-Roots', 'Rock-Reggae'),
 ('Bob-Marley-Roots-Rock', 'Reggae')]

Are regex the tool to use for that job?正则表达式是用于该工作的工具吗?

Please keep in mind that number of tags to match and delimiters between those tags are not fixed but user defined (so the regex to use has to be buildable dynamically).请记住,要匹配的标签数量和这些标签之间的分隔符不是固定的,而是用户定义的(因此要使用的正则表达式必须是可动态构建的)。
I tried to experiment with greedy vs minimal matching and lookahead assertions with no success.我尝试尝试贪婪与最小匹配前瞻断言,但没有成功。

Thanks for your help谢谢你的帮助

This solution seems to work.这个解决方案似乎有效。 In addition to the regex you will need a list of tuples to describe the pattern, where each element corresponds to one capturing group of the regex.除了正则表达式之外,您还需要一个元组列表来描述模式,其中每个元素对应于正则表达式的一个捕获组。

For your Beatles example, it would look like this:对于披头士乐队的例子,它看起来像这样:

pattern = r"the (.+_.+)-SomeWord-(.+)"
groups = [(("Artist", "Album"), "_"), ("Year", None)]

Because the Artist and Album are only split by a single separator, they will be captured together in one group.因为ArtistAlbum仅由一个分隔符分割,所以它们将被一起捕获在一个组中。 The first item in the list indicates that the first capture group will be split into and Artist and an Album , and will use _ as the separator.列表中的第一项表示第一个捕获组将被拆分为和Artist和一个Album ,并将使用_作为分隔符。 The second item in the list indicates that the second capture group will be used as the Year directly, since the second element in the tuple is None .列表中的第二项表示第二个捕获组将直接用作Year ,因为元组中的第二个元素是None You could then call the function like this:然后您可以像这样调用 function:

>>> get_mp3_info(groups, pattern, "the Beatles_Abbey_Road-SomeWord-1969")
[{'Album': 'Abbey_Road', 'Year': '1969', 'Artist': 'Beatles'}, {'Album': 'Road', 'Year': '1969', 'Artist': 'Beatles_Abbey'}]

Here is the code:这是代码:

import re
from itertools import combinations

def get_mp3_info(groups, pattern, title):
    match = re.match(pattern, title)
    if not match:
        return []
    result = [{}]
    for i, v in enumerate(groups):
        if v[1] is None:
            for r in result:
                r[v[0]] = match.group(i+1)
        else:
            splits = match.group(i+1).split(v[1])
            before = [d.copy() for d in result]
            for comb in combinations(range(1, len(splits)), len(v[0])-1):
                temp = [d.copy() for d in before]
                comb = (None,) + comb + (None,)
                for j, split in enumerate(zip(comb, comb[1:])):
                    for t in temp:
                        t[v[0][j]] = v[1].join(splits[split[0]:split[1]])

                if v[0][0] in result[0]:
                    result.extend(temp)
                else:
                    result = temp
    return result

And another example with Bob Marley:鲍勃马利的另一个例子:

>>> pprint.pprint(get_mp3_info([(("Artist", "Title"), "-")],
...               r"(.+-.+)", "Bob-Marley-Roots-Rock-Reggae"))
[{'Artist': 'Bob', 'Title': 'Marley-Roots-Rock-Reggae'},
 {'Artist': 'Bob-Marley', 'Title': 'Roots-Rock-Reggae'},
 {'Artist': 'Bob-Marley-Roots', 'Title': 'Rock-Reggae'},
 {'Artist': 'Bob-Marley-Roots-Rock', 'Title': 'Reggae'}]

What about something like this instead of using a regular expression?像这样的东西而不是使用正则表达式呢?

import re

string = "Bob-Marley-Roots-Rock-Reggae"

def allSplits(string, sep):
    results = []
    chunks = string.split('-')
    for i in xrange(len(chunks)-1):
        results.append((
            sep.join(chunks[0:i+1]),
            sep.join(chunks[i+1:len(chunks)])
        ))

    return results

print allSplits(string, '-')
[('Bob', 'Marley-Roots-Rock-Reggae'),
 ('Bob-Marley', 'Roots-Rock-Reggae'),
 ('Bob-Marley-Roots', 'Rock-Reggae'),
 ('Bob-Marley-Roots-Rock', 'Reggae')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM