简体   繁体   English

在Python中匹配2个正则表达式

[英]Matching 2 regular expressions in Python

Is it possible to match 2 regular expressions in Python? 是否可以在Python中匹配2个正则表达式?

For instance, I have a use-case wherein I need to compare 2 expressions like this: 例如,我有一个用例,其中需要比较2个这样的表达式:

re.match('google\.com\/maps', 'google\.com\/maps2', re.IGNORECASE)

I would expect to be returned a RE object. 我希望会返回一个RE对象。

But obviously, Python expects a string as the second parameter. 但显然,Python希望将字符串作为第二个参数。 Is there a way to achieve this, or is it a limitation of the way regex matching works? 有没有办法做到这一点,还是对正则表达式匹配工作方式的限制?


Background: I have a list of regular expressions [r1, r2, r3, ...] that match a string and I need to find out which expression is the most specific match of the given string. 背景:我有一个与字符串匹配的正则表达式[r1,r2,r3,...]的列表,我需要找出哪个表达式与给定字符串最匹配。 The way I assumed I could make it work was by: 我以为自己可以使用它的方式是:
(1) matching r1 with r2. (1)将r1与r2匹配。
(2) then match r2 with r1. (2)然后将r2与r1匹配。
If both match, we have a 'tie'. 如果两者都匹配,我们有一个“领带”。 If only (1) worked, r1 is a 'better' match than r2 and vice-versa. 如果仅(1)个有效,则r1比r2更“匹配”,反之亦然。
I'd loop (1) and (2) over the entire list. 我会在整个列表中循环(1)和(2)。

I admit it's a bit to wrap one's head around (mostly because my description is probably incoherent), but I'd really appreciate it if somebody could give me some insight into how I can achieve this. 我承认这有点令人费解(主要是因为我的描述可能不连贯),但如果有人可以让我对如何实现这一目标有所了解,我将非常感激。 Thanks! 谢谢!

Outside of the syntax clarification on re.match , I think I am understanding that you are struggling with taking two or more unknown (user input) regex expressions and classifying which is a more 'specific' match against a string. 除了对re.match的语法说明re.match ,我想我理解您正在努力采用两个或多个未知(用户输入)正则表达式表达式并对这些字符串进行更“特定”的匹配分类。

Recall for a moment that a Python regex really is a type of computer program. 回想一下,Python正则表达式确实是一种计算机程序。 Most modern forms, including Python's regex, are based on Perl. 大多数现代形式(包括Python的正则表达式)都基于Perl。 Perl's regex's have recursion, backtracking, and other forms that defy trivial inspection. Perl的正则表达式具有递归,回溯和其他形式的琐碎检查。 Indeed a rogue regex can be used as a form of denial of service attack . 确实,流氓正则表达式可以用作拒绝服务攻击的一种形式。

To see of this on your own computer, try: 要在自己的计算机上查看此消息,请尝试:

>>> re.match(r'^(a+)+$','a'*24+'!')

That takes about 1 second on my computer. 在我的计算机上大约需要1秒钟。 Now increase the 24 in 'a'*24 to a bit larger number, say 28 . 现在增加24'a'*24的大一点的数,如28 That take a lot longer. 那需要更长的时间。 Try 48 ... You will probably need to CTRL + C now. 尝试48 ...您可能现在需要CTRL + C The time increase as the number of a's increase is, in fact, exponential. 时间随着a的增加而增加,实际上是指数的。

You can read more about this issue in Russ Cox 's wonderful paper on 'Regular Expression Matching Can Be Simple And Fast' . 您可以在Russ Cox的精彩论文“正则表达式匹配可以简单而快速”中阅读有关此问题的更多信息。 Russ Cox is the Goggle engineer that built Google Code Search in 2006. As Cox observes, consider matching the regex 'a?'*33 + 'a'*33 against the string of 'a'*99 with awk and Perl (or Python or PCRE or Java or PHP or ...) Awk matches in 200 microseconds but Perl would require 10 15 years because of exponential back tracking. Russ Cox是Goggle工程师,于2006年建立了Google代码搜索 。正如Cox所言,请考虑将正则表达式'a?'*33 + 'a'*33与带有awk和Perl的字符串'a'*99匹配(或Python)或PCRE或Java或PHP或...)Awk匹配时间为200微秒,但Perl由于需要进行指数追溯,因此需要10到15 年的时间

So the conclusion is: it depends! 因此结论是: 这取决于! What do you mean by a more specific match? 您说的是更具体的匹配是什么意思? Look at some of Cox's regex simplification techniques in RE2 . 看看RE2中Cox的一些正则表达式简化技术。 If your project is big enough to write your own libraries (or use RE2) and you are willing to restrict the regex grammar used (ie, no backtracking or recursive forms), I think the answer is that you would classify 'a better match' in a variety of ways. 如果您的项目足够大,可以编写自己的库(或使用RE2),并且您愿意限制使用的正则表达式语法(即没有回溯或递归形式),那么我认为答案是,您可以将“更好的匹配”分类以各种方式。

If you are looking for a simple way to state that (regex_3 < regex_1 < regex_2) when matched against some string using Python or Perl's regex language, I think that the answer is it is very very hard (ie, this problem is NP Complete ) 如果您正在寻找一种简单的方法来声明(regex_3 <regex_1 <regex_2),当使用Python或Perl的regex语言将其与某些字符串进行匹配时,我认为答案是非常非常困难的(即, 此问题NP Complete

Edit 编辑

Everything I said above is true! 我上面说的一切都是真的! However, here is a stab at sorting matching regular expressions based on one form of 'specific': How many edits to get from the regex to the string. 但是,这是根据“形式”的一种形式对匹配的正则表达式进行排序的一种方法:从正则表达式到字符串要进行多少次编辑。 The greater number of edits (or the higher the Levenshtein distance) the less 'specific' the regex is. 编辑次数越多(或Levenshtein距离越大),则正则表达式的“特定性”越小。

You be the judge if this works (I don't know what 'specific' means to you for your application): 您可以判断这是否可行(我不知道“特定”对您的申请意味着什么):

import re

def ld(a,b):
    "Calculates the Levenshtein distance between a and b."
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n

    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)      
    return current[n]

s='Mary had a little lamb'    
d={}
regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb',
        r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little']

for reg in regs:
    m=re.search(reg,s)
    if m:
        print "'%s' matches '%s' with sub group '%s'" % (reg, s, m.group(0))
        ld1=ld(reg,m.group(0))
        ld2=ld(m.group(0),s)
        score=max(ld1,ld2)
        print "  %i edits regex->match(0), %i edits match(0)->s" % (ld1,ld2)
        print "  score: ", score
        d[reg]=score
        print
    else:
        print "'%s' does not match '%s'" % (reg, s)   

print "   ===== %s =====    === %s ===" % ('RegEx'.center(10),'Score'.center(10))

for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
    print "   %22s        %5s" % (key, value) 

The program is taking a list of regex's and matching against the string Mary had a little lamb . 该程序正在获取一个正则表达式列表,并与Mary had a little lamb的字符串匹配。

Here is the sorted ranking from "most specific" to "least specific": 以下是从“最具体”到“最不具体”的排序排名:

   =====   RegEx    =====    ===   Score    ===
   Mary had a little lamb            0
        Mary.*little lamb            7
            .*little lamb           11
              little lamb           11
      .*[lL]ittle [Ll]amb           15
               \blittle\b           16
                   little           16
                     Mary           18
                  \b\w+mb           18
                     lamb           18
                       .*           22

This based on the (perhaps simplistic) assumption that: a) the number of edits (the Levenshtein distance) to get from the regex itself to the matching substring is the result of wildcard expansions or replacements; 这基于(可能是简单的)假设:a)从正则表达式本身到匹配子字符串的编辑次数(Levenshtein距离)是通配符扩展或替换的结果; b) the edits to get from the matching substring to the initial string. b)从匹配子字符串到初始字符串的编辑。 (just take one) (只要一个)

As two simple examples: 作为两个简单的示例:

  1. .* (or .*.* or .*?.* etc) against any sting is a large number of edits to get to the string, in fact equal to the string length. .* (或.*.*.*?.*等)针对字符串的大量修改,实际上等于字符串的长度。 This is the max possible edits, the highest score, and the least 'specific' regex. 这是最大可能的编辑次数,最高的分数和最少的“特定”正则表达式。
  2. The regex of the string itself against the string is as specific as possible. 字符串本身相对于字符串的正则表达式尽可能具体。 No edits to change one to the other resulting in a 0 or lowest score. 没有编辑将一个更改为另一个,导致得分为0或最低。

As stated, this is simplistic. 如上所述,这很简单。 Anchors should increase specificity but they do not in this case. 锚点应增加特异性,但在这种情况下则不然。 Very short stings don't work because the wild-card may be longer than the string. 短字符串不起作用,因为通配符可能比字符串长。

Edit 2 编辑2

I got anchor parsing to work pretty darn well using the undocumented sre_parse module in Python. 使用Python中未记录的sre_parse模块,我得到了锚点解析,可以很好地工作。 Type >>> help(sre_parse) if you want to read more... 如果要阅读更多信息,请键入>>> help(sre_parse)

This is the goto worker module underlying the re module. 这是re模块基础的goto worker模块。 It has been in every Python distribution since 2001 including all the P3k versions. 自2001年以来,它已出现在所有Python发行版中,包括所有P3k版本。 It may go away, but I don't think it is likely... 可能会消失,但我认为这不太可能...

Here is the revised listing: 这是修改后的清单:

import re
import sre_parse

def ld(a,b):
    "Calculates the Levenshtein distance between a and b."
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n

    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)      
    return current[n]

s='Mary had a little lamb'    
d={}
regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb',
        r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little',
        r'^.*lamb',r'.*.*.*b',r'.*?.*',r'.*\b[lL]ittle\b \b[Ll]amb',
        r'.*\blittle\b \blamb$','^'+s+'$']

for reg in regs:
    m=re.search(reg,s)
    if m:
        ld1=ld(reg,m.group(0))
        ld2=ld(m.group(0),s)
        score=max(ld1,ld2)
        for t, v in sre_parse.parse(reg):
            if t=='at':      # anchor...
                if v=='at_beginning' or 'at_end':
                    score-=1   # ^ or $, adj 1 edit

                if v=='at_boundary': # all other anchors are 2 char
                    score-=2

        d[reg]=score
    else:
        print "'%s' does not match '%s'" % (reg, s)   

print
print "   ===== %s =====    === %s ===" % ('RegEx'.center(15),'Score'.center(10))

for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
    print "   %27s        %5s" % (key, value) 

And soted RegEx's: RegEx的:

   =====      RegEx      =====    ===   Score    ===
        Mary had a little lamb            0
      ^Mary had a little lamb$            0
          .*\blittle\b \blamb$            6
             Mary.*little lamb            7
     .*\b[lL]ittle\b \b[Ll]amb           10
                    \blittle\b           10
                 .*little lamb           11
                   little lamb           11
           .*[lL]ittle [Ll]amb           15
                       \b\w+mb           15
                        little           16
                       ^.*lamb           17
                          Mary           18
                          lamb           18
                       .*.*.*b           21
                            .*           22
                         .*?.*           22

It depends on what kind of regular expressions you have; 这取决于您使用的是哪种正则表达式。 as @carrot-top suggests, if you actually aren't dealing with "regular expressions" in the CS sense, and instead have crazy extensions, then you are definitely out of luck. 正如@ carrot-top建议的那样,如果您实际上不是在CS意义上处理“正则表达式”,而是拥有疯狂的扩展名,那么您肯定不走运。

However, if you do have traditional regular expressions, you might make a bit more progress. 但是,如果您确实有传统的正则表达式,则可能会有所进步。 First, we could define what "more specific" means. 首先,我们可以定义“更具体”的含义。 Say R is a regular expression, and L(R) is the language generated by R. Then we might say R1 is more specific than R2 if L(R1) is a (strict) subset of L(R2) (L(R1) < L(R2)). 假设R是一个正则表达式,而L(R)是R生成的语言。那么如果L(R1)是L(R2)的(严格)子集(L(R1) <L(R2))。 That only gets us so far: in many cases, L(R1) is neither a subset nor a superset of L(R2), and so we might imagine that the two are somehow incomparable. 到目前为止,这还很简单:在许多情况下,L(R1)既不是L(R2)的子集也不是其超集,因此我们可以想象两者在某种程度上是不可比的。 An example, trying to match "mary had a little lamb", we might find two matching expressions: .*mary and lamb.* . 例如,尝试匹配“玛丽的小羊羔”,我们可能会找到两个匹配的表达式: .*marylamb.*

One non-ambiguous solution is to define specificity via implementation . 一种明确的解决方案是通过实现定义特异性 For instance, convert your regular expression in a deterministic (implementation-defined) way to a DFA and simply count states. 例如,以确定性(实现定义)的方式将正则表达式转换为DFA,并仅对状态进行计数。 Unfortunately, this might be relatively opaque to a user. 不幸的是,这对于用户而言可能是相对不透明的。

Indeed, you seem to have an intuitive notion of how you want two regular expressions to compare, specificity-wise. 确实,您似乎有一个直觉的概念,即您希望如何在特异性方面比较两个正则表达式。 Why not simple write down a definition of specificity, based on the syntax of regular expressions, that matches your intuition reasonably well? 为什么不简单地根据正则表达式的语法写下与您的直觉合理匹配的特异性定义?

Totally arbitrary rules follow: 完全遵循任意规则:

  1. Characters = 1 . 字符= 1
  2. Character ranges of n characters = n (and let's say \\b = 5 , because I'm not sure how you might choose to write it out long-hand). n字符的字符范围= n (假设\\b = 5 ,因为我不确定您可能会选择如何将其写出来)。
  3. Anchors are 5 each. 每个锚点是5个。
  4. * divides its argument by 2 . *将其参数除以2
  5. + divides its argument by 2 , then adds 1 . +将其参数除以2 ,然后加1
  6. . = -10 . = -10

Anyway, just food for thought, as the other answers do a good job of outlining some of the issues you're facing; 无论如何,只要想一想,其他答案就可以很好地概述您面临的一些问题。 hope it helps. 希望能帮助到你。

I don't think it's possible. 我认为这是不可能的。

An alternative would be to try to calculate the number of strings of length n that the regular expression also matches. 一种替代方法是尝试计算正则表达式也匹配的长度为n的字符串的数量。 A regular expression that matches 1,000,000,000 strings of length 15 characters is less specific than one that matches only 10 strings of length 15 characters. 匹配1,000,000,000个长度为15个字符的字符串的正则表达式比仅匹配10个长度为15个字符的字符串的正则表达式具体程度较低。

Of course, calculating the number of possible matches is not trivial unless the regular expressions are simple. 当然,除非正则表达式简单,否则计算可能的匹配数并不是一件容易的事。

Option 1: 选项1:

Since users are supplying the regexes, perhaps ask them to also submit some test strings which they think are illustrative of their regex's specificity. 由于用户正在提供正则表达式,因此可能要求他们还提交一些他们认为可以说明其正则表达式特性的测试字符串。 (ie that show their regex is more specific than a competitor's regex.) Collect all the user's submitted test strings, and then test all the regexes against the complete set of test strings. (即表明他们的正则表达式比竞争对手的正则表达式更具体。)收集所有用户提交的测试字符串,然后针对完整的测试字符串集测试所有正则表达式。

To design a good regex, the author must have put thought into what strings match and don't match their regex, so it should be easy for them to supply good test strings. 要设计一个好的正则表达式,作者必须考虑什么字符串匹配和不匹配其正则表达式,因此对于他们来说,提供良好的测试字符串应该很容易。


Option 2: 选项2:

You might try a Monte Carlo approach: Starting with the string that both regexes match, write a generator which generates mutations of that string (permute characters, add/remove characters, etc.) If both regexes match or don't match the same way for each mutation, then the regexes " probably tie ". 您可以尝试使用蒙特卡洛方法:从两个正则表达式都匹配的字符串开始,编写一个生成器以生成该字符串的突变(置换字符,添加/删除字符等)。如果两个正则表达式都以相同的方式匹配或不匹配对于每个突变,则正则表达式“ 可能是 ”。 If one matches a mutations that the other doesn't, and vice versa, then they " absolutely tie ". 如果一个与另一个不匹配的突变匹配,反之亦然,则它们“ 绝对匹配”。

But if one matches a strict superset of mutations then it is " probably less specific " than the other. 但是,如果一个匹配严格的突变超集,那么它“ 可能不如另一个特异 ”。

The verdict after a large number of mutations may not always be correct, but may be reasonable. 大量突变后的结论可能并不总是正确的,但可能是合理的。


Option 3: 选项3:

Use ipermute or pyParsing's invert to generate strings which match each regex. 使用ipermute或pyParsing的invert生成与每个正则表达式匹配的字符串。 This will only work on a regexes that use a limited subset of regex syntax. 这仅适用于使用正则表达式语法的有限子集的正则表达式。

I think you could do it by looking the result of matching with the longest result 我认为您可以通过查看匹配结果最长的结果来做到这一点

>>> m = re.match(r'google\.com\/maps','google.com/maps/hello')
>>> len(m.group(0))
15

>>> m = re.match(r'google\.com\/maps2','google.com/maps/hello')
>>> print (m)
None

>>> m = re.match(r'google\.com\/maps','google.com/maps2/hello')
>>> len(m.group(0))
15

>>> m = re.match(r'google\.com\/maps2','google.com/maps2/hello')
>>> len(m.group(0))
16
re.match('google\.com\/maps', 'google\.com\/maps2', re.IGNORECASE)

The second item to re.match() above is a string -- that's why it's not working: the regex says to match a period after google, but instead it finds a backslash. 上面re.match()的第二个项目一个字符串-这就是为什么它不起作用的原因:正则表达式表示要匹配Google之后的句点,而是找到一个反斜杠。 What you need to do is double up the backslashes in the regex that's being used as a regex: 您需要做的是将用作正则表达式的正则表达式中的反斜杠加倍:

def compare_regexes(regex1, regex2):
    """returns regex2 if regex1 is 'smaller' than regex2
    returns regex1 if they are the same
    returns regex1 if regex1 is 'bigger' than regex2
    otherwise returns None"""
    regex1_mod = regex1.replace('\\', '\\\\')
    regex2_mod = regex2.replace('\\', '\\\\')
    if regex1 == regex2:
        return regex1
    if re.match(regex1_mod, regex2):
        return regex2
    if re.match(regex2_mod, regex1):
        return regex1

You can change the returns to whatever suits your needs best. 您可以将收益更改为最适合您的需求。 Oh, and make sure you are using raw strings with re. 哦,请确保您使用带有re的原始字符串。 r'like this, for example'

Is it possible to match 2 regular expressions in Python? 是否可以在Python中匹配2个正则表达式?

That certainly is possible. 当然可以。 Use parenthetical match groups joined by | 使用由|括号匹配组 for alteration. 进行更改。 If you arrange the parenthetical match groups by most specific regex to least specific, the rank in the returned tuple from m.groups() will show how specific your match is. 如果按最具体的正则表达式将括号匹配组按最不特定的正则表达式排列,则m.groups()返回的元组中的排名将显示匹配的具体程度。 You can also use named groups to name how specific your match is, such as s10 for very specific and s0 for a not so specific match. 您也可以使用命名组来命名匹配的具体程度,例如s10表示非常具体的匹配, s0表示不太具体的匹配。

>>> s1='google.com/maps2text'
>>> s2='I forgot my goggles at the house'
>>> s3='blah blah blah'
>>> m1=re.match(r'(^google\.com\/maps\dtext$)|(.*go[a-z]+)',s1)
>>> m2=re.match(r'(^google\.com\/maps\dtext$)|(.*go[a-z]+)',s2)
>>> m1.groups()
('google.com/maps2text', None)
>>> m2.groups()
(None, 'I forgot my goggles')
>>> patt=re.compile(r'(?P<s10>^google\.com\/maps\dtext$)|
... (?P<s5>.*go[a-z]+)|(?P<s0>[a-z]+)')
>>> m3=patt.match(s3)
>>> m3.groups()
(None, None, 'blah')
>>> m3.groupdict()
{'s10': None, 's0': 'blah', 's5': None}

If you do not know ahead of time which regex is more specific, this is a much harder problem to solve. 如果您提前不知道哪个正则表达式更具体,那么这个问题就很难解决了。 You want to have a look at this paper covering security of regex matches against file system names. 您想看一下这篇涵盖正则表达式与文件系统名称匹配的安全性的论文

I realize that this is a non-solution, but as there is no unambiguous way to tell which is the "most specific match", certainly when it depends on what your users "meant", the easiest thing to do would be to ask them to provide their own priority. 我意识到这不是一个解决方案,但是由于没有明确的方法来判断哪个是“最具体的匹配项”,因此,当然,这取决于用户的“意愿”,最简单的方法是询问他们提供自己的优先级。 For example just by putting the regexes in the right order. 例如,只需将正则表达式放在正确的顺序中即可。 Then you can simply take the first one that matches. 然后,您可以简单地选择第一个匹配的对象。 If you expect the users to be comfortable with regular expressions anyway, this is maybe not too much to ask? 如果您仍然希望用户对正则表达式感到满意,那么问这个问题可能不是太多吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM