简体   繁体   中英

Determining whether a string is between two other strings alphabetically

I have 2 lists. The first is just a list of strings. The second is a list of tuples of strings. Say I have string s from the first list. I want to find all the pairs in the second list where s falls in between alphabetically. A concrete example:

s = "QZ123DEF"

("QZ123ABC", "QZ125ZEQ") # would return as a positive match
("QF12", "QY22") # would not return as a positive match

I thought of sort of brute force approach that would be to check if s is greater than the first string and less than a second for all tuples in the second list, but I wanted to know if there is a better way. By the way, I'm using python.

Here's one way using the bisect module, this requires S to be sorted first:

import bisect
import pprint
S = ['b', 'd', 'j', 'n', 's']
pairs = [('a', 'c'), ('a', 'e'), ('a', 'z')]

output = {}

for a, b in pairs:

    # Here `a_ind` and `b_ind` are the indices where `a` and `b` will fit in
    # the list `S`. Using these indices we can find the items from the list that will lie 
    # under `a` and `b`.

    a_ind = bisect.bisect_left(S, a)
    b_ind = bisect.bisect_right(S, b)

    for x in S[a_ind : b_ind]:
        output.setdefault(x, []).append((a, b))

pprint.pprint(output)

Output:

{'b': [('a', 'c'), ('a', 'e'), ('a', 'z')],
 'd': [('a', 'e'), ('a', 'z')],
 'j': [('a', 'z')],
 'n': [('a', 'z')],
 's': [('a', 'z')]}

On comparison with the brute force method on a random data this is 2-3 time faster:

def solve(S, pairs):

    S.sort()
    output = {}
    for a, b in pairs:
        a_ind = bisect.bisect_left(S, a)
        b_ind = bisect.bisect_right(S, b)
        for x in S[a_ind : b_ind]:
            output.setdefault(x, []).append((a, b))

def brute_force(S, pairs):

    output = {}
    for s in S:
        for a, b in pairs:
            if a <= s <= b:
                output.setdefault(s, []).append((a, b))

def get_word():
    return ''.join(random.choice(string.letters))

S = [get_word() for _ in xrange(10000)]
pairs = [sorted((get_word(), get_word())) for _ in xrange(1000)]

Timing comparison:

In [1]: %timeit brute_force(S, pairs)                                                                              
1 loops, best of 3: 10.2 s per loop                                                                                

In [2]: %timeit solve(S, pairs)                                                                                    
1 loops, best of 3: 3.94 s per loop                                                                                
def between((tupa,tupb),val):
    return tupa <= val <= tupb

s = "QZ123DEF"
print filter(lambda tup:between(tup,s),my_list_tuples)

maybe ... but its still "brute-force"

So assuming there's only two entries in the tuple you can do a little comprehension:

>>> s = "QZ123DEF"
>>> testList = [("QZ123ABC", "QZ125ZEQ"), ("QF12", "QY22")]
>>> [test[0] <= s <= test[1] for test in testList]
[True, False]

This can be expanded for a list of s 's with the results stored in a dict :

>>> S = ["QZ123DEF", "QG42"]
>>> {s: [test[0] <= s <= test[1] for test in testList] for s in S}
{'QZ123DEF': [True, False], 'QG42': [False, True]}

I don't know whether it is a brute force or not but following code works:

def foo(s,a,b):
    if s<=a and s>=b:
        return True
    if s>=a and s<=b:
        return True
    return False


print foo("QZ123DEF", "QZ123ABC", "QZ125ZEQ") --> True
print foo("QZ123DEF", "QF12", "QY22") --> False

If the number of pairs is large and the number of searches is also considerable, the following algorithm may be advantageous. (I regret not having had the time for any comparisons yet.)

This algorithm copies all strings from the second list to a table, where entries are: a) a string, and b) the index into the original list, but negative ("flagged") for each "second" strings Then, sort this table according to the string component from the second list.

Then, for a string s from the second list, find the smallest entry in strpos whose string is greater or equal to s.

Finally, collect all indices from that entry onward to the end of the table, remembering positive indices and skipping their negative counterparts. This will give you all pairs enclosing string s.

Dump of a strpos table:

AAA at 1
BBB at 2
CCC at -1
FFF at -2
HHH at 3
LLL at -3
NNN at 4
ZZZ at -4

Results for three strings:

for ABC found AAA - CCC
for XYZ found NNN - ZZZ
for IJK found HHH - LLL
for HHH found HHH - LLL

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM