简体   繁体   中英

Check logical concatenation of regular expressions

I have the following problem in python, which I hope you can assist with.

The input is 2 regular expressions, and I have to check if their concatenation can have values. For example if one says take strings with length greater than 10 and the other says at most 5, than no value can ever pass both expressions.

Is there something in python to solve this issue?

Thanks, Max.

Is there something in python to solve this issue?

There is nothing in Python that solves this directly.

That said, you can simulate a logical-and operation for two regexes by using lookahead assertions. There is a good explanation with examples at Regular Expressions: Is there an AND operator?

This will combine the regexes but won't show directly whether some string exists that satisfies the combined regex.

Getting this brute force algorithm from here: Generating a list of values a regex COULD match in Python

def all_matching_strings(alphabet, max_length, regex1, regex2):
"""Find the list of all strings over 'alphabet' of length up to 'max_length' that match 'regex'"""

if max_length == 0: return 

L = len(alphabet)
for N in range(1, max_length+1):
    indices = [0]*N
    for z in xrange(L**N):
        r = ''.join(alphabet[i] for i in indices)
        if regex1.match(r) and regex2.match(r):                
           yield(r)

        i = 0
        indices[i] += 1
        while (i<N) and (indices[i]==L):
            indices[i] = 0
            i += 1
            if i<N: indices[i] += 1

return

example usage, for your situation (two regexes)... you'd need to add all possible symbols/whitespaces/etc to that alphabet also...:

alphabet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890'
import re
regex1 = re.compile(regex1_str)
regex2 = re.compile(regex1_str)
for r in all_matching_strings(alphabet, 5, regex1, regex2): 
    print r

That said, the runtime on this is super-crazy and you'll want to do whatever you can to speed it up. One suggestion on the answer I swiped the algorithm from was to filter the alphabet to only have characters that are "possible" for the regex. So if you scan your regex and you only see [1-3] and [a-eA-E], with no ".", "\\w", "\\s", etc, then you can reduce the size of the alphabet to 13 length. Lots of other little tricks you could implement as well.

I highly doubt that something like this is implemented and even that there is a way to efficiently compute it.

One approximative way that comes to my mind now, that detects the most obvious conflicts would be to generate a random string conforming to each of the regexes, and then check if the concatenation of the regexes matches the concatenation of the generated strings.

Something like:

import re, rstr
s1 = rstr.xeger(r1)
s2 = rstr.xeger(r2)
print re.match(r1 + r2, s1 + s2)

Although I can't really think of a way for this to fail. In my opinion, for your example, where r1 matches strings with more than 10 chars, r2 matches strings shorter than 5 chars, then the sum of the two would yield strings with the first part longer than 10 and a tail of less than 5.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM