Find substrings in a set of strings

Question

I have a large (50k-100k) set of strings mystrings . Some of the strings in mystrings may be exact substrings of others, and I would like to collapse these (discard the substring and only keep the longest). Right now I'm using a naive method, which has O(N^2) complexity.

unique_strings = set()
for s in sorted(mystrings, key=len, reverse=True):
    keep = True
    for us in unique_strings:
        if s in us:
            keep = False
            break
    if keep:
        unique_strings.add(s)

Which data structures or algorithms would make this task easier and not require O(N^2) operations. Libraries are ok, but I need to stay pure Python.

Answer 1

Finding a substring in a set():

name = set()
name.add('Victoria Stuart')                         ## add single element
name.update(('Carmine Wilson', 'Jazz', 'Georgio'))  ## add multiple elements
name
{'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}

me = 'Victoria'
if str(name).find(me):
    print('{} in {}'.format(me, name))
# Victoria in {'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}

That's pretty easy -- but somewhat problematic, if you want to return the matching string:

for item in name:
    if item.find(me):
            print(item)
'''
Jazz
Georgio
Carmine Wilson
'''

print(str(name).find(me))
# 39    ## character offset for match (i.e., not a string)

As you can see, the loop above only executes until the condition is True , terminating before printing the item we want (the matching string).

It's probably better, easier to use regex (regular expressions):

import re

for item in name:
    if re.match(me, item):
            full_name = item
            print(item)
# Victoria Stuart
print(full_name)
# Victoria Stuart

for item in name:
    if re.search(me, item):
            print(item)
# Victoria Stuart

From the Python docs :

search() vs. match()

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string ...

Answer 2

A naive approach:

1. sort strings by length, longest first  # `O(N*log_N)`
2. foreach string:  # O(N)
    3. insert each suffix into tree structure: first letter -> root, and so on.  
       # O(L) or O(L^2) depending on string slice implementation, L: string length
    4. if inserting the entire string (the longest suffix) creates a new 
       leaf node, keep it!

O[N*(log_N + L)]  or  O[N*(log_N + L^2)]

This is probably far from optimal, but should be significantly better than O(N^2) for large N (number of strings) and small L (average string length).

You could also iterate through the strings in descending order by length and add all substrings of each string to a set, and only keep those strings that are not in the set. The algorithmic big O should be the same as for the worse case above ( O[N*(log_N + L^2)] ), but the implementation is much simpler:

seen_strings, keep_strings = set(), set()
for s in sorted(mystrings, key=len, reverse=True):
    if s not in seen_strings:
        keep_strings.add(s)
        l = len(s)
        for start in range(0, l-1):
            for end in range(start+1, l):
                seen_strings.add(s[start:end])

Answer 3

In the mean time I came up with this approach.

from Bio.trie import trie
unique_strings = set()
suffix_tree = trie()
for s in sorted(mystrings, key=len, reverse=True):
    if suffix_tree.with_prefix(contig) == []:
        unique_strings.add(s)
        for i in range(len(s)):
            suffix_tree[s[i:]] = 1

The good : ≈15 minutes --> ≈20 seconds for the data set I was working with. The bad : introduces biopython as a dependency, which is neither lightweight nor pure python (as I originally asked).

Answer 4

You can presort the strings and create a dictionary that maps strings to positions in the sorted list. Then you can loop over the list of strings (O(N)) and suffixes (O(L)) and set those entries to None that exist in the position-dict (O(1) dict lookup and O(1) list update). So in total this has O(N*L) complexity where L is the average string length.

strings = sorted(mystrings, key=len, reverse=True)
index_map = {s: i for i, s in enumerate(strings)}
unique = set()
for i, s in enumerate(strings):
    if s is None:
        continue
    unique.add(s)
    for k in range(1, len(s)):
        try:
            index = index_map[s[k:]]
        except KeyError:
            pass
        else:
            if strings[index] is None:
                break
            strings[index] = None

Testing on the following sample data gives a speedup factor of about 21:

import random
from string import ascii_lowercase

mystrings = [''.join(random.choices(ascii_lowercase, k=random.randint(1, 10)))
             for __ in range(1000)]
mystrings = set(mystrings)

Find substrings in a set of strings

Question

4 answers

solution1
1 2020-03-17 18:48:10

solution2
0 2017-01-19 08:11:30

solution3
0 2017-01-19 20:12:43

solution4
0 2020-03-17 19:40:24

Find substrings in a set of strings

Question

4 answers

solution1 1 2020-03-17 18:48:10

solution2 0 2017-01-19 08:11:30

solution3 0 2017-01-19 20:12:43

solution4 0 2020-03-17 19:40:24

solution1
1 2020-03-17 18:48:10

solution2
0 2017-01-19 08:11:30

solution3
0 2017-01-19 20:12:43

solution4
0 2020-03-17 19:40:24