简体   繁体   English

检查正则表达式的逻辑串联

[英]Check logical concatenation of regular expressions

I have the following problem in python, which I hope you can assist with. 我在python中遇到以下问题,希望您可以提供帮助。

The input is 2 regular expressions, and I have to check if their concatenation can have values. 输入的是2个正则表达式,我必须检查它们的串联是否可以有值。 For example if one says take strings with length greater than 10 and the other says at most 5, than no value can ever pass both expressions. 例如,如果一个说的字符串长度大于10,而另一个说的字符串长度最大为5,​​则没有值可以传递两个表达式。

Is there something in python to solve this issue? python中有什么东西可以解决这个问题?

Thanks, Max. 谢谢,马克斯。

Is there something in python to solve this issue? python中有什么东西可以解决这个问题?

There is nothing in Python that solves this directly. Python中没有什么可以直接解决这个问题。

That said, you can simulate a logical-and operation for two regexes by using lookahead assertions. 也就是说,您可以通过使用超前断言来模拟两个正则表达式的逻辑与运算。 There is a good explanation with examples at Regular Expressions: Is there an AND operator? 正则表达式中的示例有一个很好的解释:是否有AND运算符?

This will combine the regexes but won't show directly whether some string exists that satisfies the combined regex. 这将合并正则表达式,但不会直接显示是否存在满足组合正则表达式的字符串。

Getting this brute force algorithm from here: Generating a list of values a regex COULD match in Python 从这里获取这种蛮力算法: 生成正则表达式可以在Python中匹配的值列表

def all_matching_strings(alphabet, max_length, regex1, regex2):
"""Find the list of all strings over 'alphabet' of length up to 'max_length' that match 'regex'"""

if max_length == 0: return 

L = len(alphabet)
for N in range(1, max_length+1):
    indices = [0]*N
    for z in xrange(L**N):
        r = ''.join(alphabet[i] for i in indices)
        if regex1.match(r) and regex2.match(r):                
           yield(r)

        i = 0
        indices[i] += 1
        while (i<N) and (indices[i]==L):
            indices[i] = 0
            i += 1
            if i<N: indices[i] += 1

return

example usage, for your situation (two regexes)... you'd need to add all possible symbols/whitespaces/etc to that alphabet also...: 示例用法,针对您的情况(两个正则表达式)...您还需要将所有可能的符号/空格/等添加到该字母...:

alphabet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890'
import re
regex1 = re.compile(regex1_str)
regex2 = re.compile(regex1_str)
for r in all_matching_strings(alphabet, 5, regex1, regex2): 
    print r

That said, the runtime on this is super-crazy and you'll want to do whatever you can to speed it up. 就是说,运行时非常疯狂,您将想尽一切办法加快运行速度。 One suggestion on the answer I swiped the algorithm from was to filter the alphabet to only have characters that are "possible" for the regex. 关于将算法从中剔除的答案,一个建议是过滤字母,使其仅包含对正则表达式“可能”的字符。 So if you scan your regex and you only see [1-3] and [a-eA-E], with no ".", "\\w", "\\s", etc, then you can reduce the size of the alphabet to 13 length. 因此,如果您扫描自己的正则表达式,并且只看到[1-3]和[a-eA-E],而没有“。”,“ \\ w”,“ \\ s”等,那么您可以减小字母长度为13。 Lots of other little tricks you could implement as well. 您还可以实施许多其他小技巧。

I highly doubt that something like this is implemented and even that there is a way to efficiently compute it. 我非常怀疑是否实现了这样的功能,甚至没有办法有效地对其进行计算。

One approximative way that comes to my mind now, that detects the most obvious conflicts would be to generate a random string conforming to each of the regexes, and then check if the concatenation of the regexes matches the concatenation of the generated strings. 我现在想到的一种检测最明显冲突的近似方法是生成一个符合每个正则表达式的随机字符串,然后检查正则表达式的串联是否与生成的字符串的串联匹配。

Something like: 就像是:

import re, rstr
s1 = rstr.xeger(r1)
s2 = rstr.xeger(r2)
print re.match(r1 + r2, s1 + s2)

Although I can't really think of a way for this to fail. 尽管我真的想不出解决办法。 In my opinion, for your example, where r1 matches strings with more than 10 chars, r2 matches strings shorter than 5 chars, then the sum of the two would yield strings with the first part longer than 10 and a tail of less than 5. 以我的观点,以您的示例为例,其中r1匹配的字符串超过10个字符, r2匹配的字符串小于5个字符,那么这两个的总和将得出字符串的第一部分长于10且尾部小于5。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM