简体   繁体   English

有没有办法在正则表达式中以任何顺序匹配一组组?

[英]Is there a way to match a set of groups in any order in a regex?

I looked through the related questions, there were quite a few but I don't think any answered this question. 我浏览了相关的问题,有很多问题,但我认为没有人回答这个问题。 I am very new to Regex but I'm trying to get better so bear with me please. 我是Regex的新手,但我正努力变得更好,所以请多多包涵。 I am trying to match several groups in a string, but in any order. 我正在尝试以字符串形式匹配多个组,但顺序不限。 Is this something I should be using Regex for? 这是我应该使用Regex的东西吗? If so, how? 如果是这样,怎么样? If it matters, I plan to use these in IronPython. 如果有问题,我计划在IronPython中使用它们。

EDIT: Someone asked me to be more specific, so here: 编辑:有人要求我更具体,所以在这里:

I want to use re.match with a regex like: 我想将re.match与以下正则表达式一起使用:

\\[image\\s*(?(@alt:(?<alt>.*?);).*(@title:(?<title>.*?);))*.*\\](?<arg>.*?)\\[\\/image\\]

But it will only match the named groups when they are in the right order, and separated with a space. 但是,只有当命名的组以正确的顺序排列并且用空格分隔时,它才会匹配。 I would like to be able to match the named groups in any order, as long as they appear where they do now in the regex. 我希望能够以任何顺序匹配命名组,只要它们出现在正则表达式中。

A typical string that will be applied to this might look like: 将应用于此的典型字符串如下所示:

[image @alt:alien; @title:reddit alien;]http://www.reddit.com/alien.png[/image]

But I should have no problem matching: 但是我应该没有问题匹配:

[image @title:reddit alien; @alt:alien;]http://www.reddit.com/alien.png[/image]

So the 'attributes' (things that come between '@' and ';' in the first 'tag') should be matched in any order, as long as they both appear. 因此,“属性”(第一个“标签”中位于“ @”和“;”之间的内容)应按任意顺序匹配,只要它们都出现即可。

The answer to the question in your title is "no" -- to match N groups "in any order", the regex should have an "or" (the | feature in the regex pattern) among the N! 您标题中问题的答案是“否”-要以“任何顺序”匹配N个组,正则表达式应在N个字符之间有一个“或”(正则表达式模式中的|功能) N! (N factorial) possible permutations of the groups, the product of all integers from 1 to N. That's a number which grows extremely fast -- for N just equal 6, it's already 720, for 7, it's almost 5000, and so on at a dizzying pace -- so this approach is totally impractical for any N which isn't really tiny. 组的(N个阶乘)可能的排列,即从1到N的所有整数的乘积。这个数字的增长非常快-对于N等于6,已经是720,对于7来说,已经接近5000,依此类推令人眼花pace乱的步伐-因此,对于任何不是很小的N来说,这种方法都是不切实际的。

The solutions may be many, depending on what you want the groups to be separated with. 解决方案可能很多,具体取决于您希望与之分离的组。 Let's say, for example, that you don't care (if you DO care, edit your question with better specs). 例如,假设您不在乎(如果您确实在乎,请使用更好的规范编辑您的问题)。

In this case, if overlapping matches are impossible or are OK with you, make N separate regular expressions, one per group -- say these N compiled RE objects are in a list named grps , then 在这种情况下,如果不可能进行重叠匹配或您可以接受,则制作N个独立的正则表达式,每组一个-假设这N个已编译RE对象在名为grps的列表中,然后

mos = [g.search(thestring) for g in grps]

is the list of match objects for the groups ( None for a group which doesn't match). 是组的匹配对象的列表(对于不匹配的组,为None )。 With the mos list you can do all sorts of checks and/or further manipulations, for example all(mos) is True if and only if all the groups matched, in which case [m.group() for m in mos] is the list of substrings that have been matched, and so on, and so forth. 使用mos列表,您可以进行各种检查和/或进一步的操作,例如,当且仅当所有组都匹配时, all(mos)True ,在这种情况下, [m.group() for m in mos]是匹配的子字符串列表,等等,依此类推。

If you need non-overlapping matches, it's a bit more complicated -- you may extract the boundaries of all possible matches for each group, then seeing if there's a way to extract from these N lists a set of N intervals, one per lists, so that no two of them are pairwise intersecting. 如果您需要非重叠匹配,则要复杂一些-您可以提取每个组所有可能匹配的边界,然后查看是否有一种方法可以从这N列表中提取一组N间隔,每个间隔一个,这样它们中没有两个成对相交。 This is a somewhat subtle algorithm (if you want reasonable speed for a large N , of course), so I think it's worth a separate question, and in any case it's not worth discussing right here when the very issue of whether it's needed or not depends on so incredibly many factors that you have not specified. 这是一个微妙的算法(当然,如果您想为大的N要求合理的速度),所以我认为这是一个单独的问题,在任何情况下都不需要讨论是否需要它的问题。取决于许多您未指定的因素。

So, please edit your question with more precise specifications, first, and then things can perhaps be clarified to provide you with the code and/or algorithms you need. 因此,请首先使用更精确的规范来编辑您的问题,然后可能需要澄清一些事情,以便为您提供所需的代码和/或算法。

Edit : I see the OP has now clarified the issue at least of the extent of providing an example -- although, confusingly, he offers a RE pattern example and a string example that should not match, regardless of ordering (the RE specifies the presence of a substring @title which the example string does not have -- puzzling!). 编辑 :我看到OP现在至少在提供示例的程度上澄清了这个问题-尽管令人困惑,他提供了一个RE模式示例和一个匹配顺序的字符串示例,无论顺序如何(RE指定存在的一个子@title ,范例字符串没有 ! -令人费解)。

Anyway, if the number of groups in the example (two which appear to be interchangeable, one which appears to have to occur in a specific spot) is representative of the OP's actual problems, then the total number of permutations of interest is just two, so joining the "just two" permutations with a vertical bar | 无论如何,如果示例中的组数量(两个似乎可以互换,一个似乎必须在特定位置出现)代表了OP的实际问题,那么感兴趣的排列总数仅为两个,所以用竖线连接“仅两个”排列| would of course be quite feasible. 当然是很可行的。 Is that the case in the OP's real problems, though...? 但是,OP的实际问题就是这种情况吗?

Edit : if the number of permutations of interest is tiny, here's an example of one way to avoid the problem of repeated group names in the pattern (syntax requires Python 2.7 or better, but that's just for the final "dict comprehension" -- the same functionality is available in many previous version of Python, just with the less elegant dict(('a', ... syntax;-)...: 编辑 :如果感兴趣的排列数量很少,这是一种避免模式中重复出现组名问题的方法的示例(语法要求使用Python 2.7或更高版本,但这仅用于最终的“字典理解”-在许多以前的Python版本中都提供了相同的功能,只是使用了不太优雅的dict(('a', ...语法;-)...:

>>> r = re.compile(r'(?P<a1>a.*?a).*?(?P<b1>b.*?b)|(?P<b2>b.*?b).*?(?P<a2>a.*?a)')
>>> m = r.search('zzzakkkavvvbxxxbnnn')
>>> g = m.groupdict()
>>> d = {'a':(g.get('a1') or g.get('a2')), 'b':(g.get('b1') or g.get('b2'))}
>>> d
{'a': 'akkka', 'b': 'bxxxb'}

This is very similar to one of the key problems with using regular expressions to parse HTML - there is no requirement that attributes always be specified in the same order, and many tags have surprising attributes (like <br clear="all"> . So it seems you are working with a very similar markup syntax. 这非常类似于使用正则表达式解析HTML的关键问题之一-不需要始终以相同的顺序指定属性,并且许多标记具有令人惊讶的属性(例如<br clear="all"> 。看来您正在使用非常相似的标记语法。

Pyparsing addresses this problem in an indirect way - instead of trying to parse all different permutations, parse the general "@attrname:attribute value;" Pyparsing以间接方式解决了这个问题-解析常规的“ @attrname:attribute value;”而不是尝试解析所有不同的排列。 syntax, and keep track of the attributes keys and values in an attribute mapping data structure. 语法,并在属性映射数据结构中跟踪属性键和值。 The mapping makes it easy to get the "title" attribute, regardless of whether it came first or last in the image tag. 映射使获取“ title”属性变得容易,而不管它在图像标签中是第一个还是最后一个。 This behavior is built into the pyparsing API methods, makeHTMLTags and makeXMLTags. pyparsing API方法,makeHTMLTags和makeXMLTags中内置了此行为。

Of course, this markup is not XML, but a similar approach gives some pretty easy to work with results: 当然,此标记不是 XML,但是类似的方法使结果处理起来非常容易:

text = """[image @alt:alien; @title:reddit alien;]http://www.reddit.com/alien1.png[/image]

But I should have no problem matching:

[image @title:reddit alien; @alt:alien;]http://www.reddit.com/alien2.png[/image]
"""

from pyparsing import Suppress, Group, Word, alphas, SkipTo, Dict, ZeroOrMore

LBRACK,RBRACK,COLON,SEMI,AT = map(Suppress,"[]:;@")
tagAttribute = Group(AT + Word(alphas) + COLON + SkipTo(SEMI) + SEMI)
imageTag = LBRACK + "image" + Dict(ZeroOrMore(tagAttribute)) + RBRACK
imageLink = imageTag + SkipTo("[/image]")("text")

for taginfo in imageLink.searchString(text):
    print taginfo.alt
    print taginfo.title
    print taginfo.text
    print

Prints: 打印:

alien
reddit alien
http://www.reddit.com/alien1.png

alien
reddit alien
http://www.reddit.com/alien2.png

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM