[英]Find substrings in a set of strings
I have a large (50k-100k) set of strings mystrings
.我有一组大(50k-100k)的字符串mystrings
。 Some of the strings in mystrings
may be exact substrings of others, and I would like to collapse these (discard the substring and only keep the longest). mystrings
一些字符串可能是其他字符串的精确子字符串,我想折叠它们(丢弃子字符串,只保留最长的)。 Right now I'm using a naive method, which has O(N^2)
complexity.现在我正在使用一种简单的方法,它的复杂度为O(N^2)
。
unique_strings = set()
for s in sorted(mystrings, key=len, reverse=True):
keep = True
for us in unique_strings:
if s in us:
keep = False
break
if keep:
unique_strings.add(s)
Which data structures or algorithms would make this task easier and not require O(N^2)
operations.哪些数据结构或算法将使此任务更容易,并且不需要O(N^2)
操作。 Libraries are ok, but I need to stay pure Python.图书馆没问题,但我需要保持纯 Python。
Finding a substring in a set():在 set() 中查找子字符串:
name = set()
name.add('Victoria Stuart') ## add single element
name.update(('Carmine Wilson', 'Jazz', 'Georgio')) ## add multiple elements
name
{'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}
me = 'Victoria'
if str(name).find(me):
print('{} in {}'.format(me, name))
# Victoria in {'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}
That's pretty easy -- but somewhat problematic, if you want to return the matching string:这很简单——但如果你想返回匹配的字符串,就有点问题:
for item in name:
if item.find(me):
print(item)
'''
Jazz
Georgio
Carmine Wilson
'''
print(str(name).find(me))
# 39 ## character offset for match (i.e., not a string)
As you can see, the loop above only executes until the condition is True
, terminating before printing the item we want (the matching string).如您所见,上面的循环只执行直到条件为True
,在打印我们想要的项目(匹配字符串)之前终止。
It's probably better, easier to use regex (regular expressions):使用正则表达式(正则表达式)可能更好,更容易:
import re
for item in name:
if re.match(me, item):
full_name = item
print(item)
# Victoria Stuart
print(full_name)
# Victoria Stuart
for item in name:
if re.search(me, item):
print(item)
# Victoria Stuart
From the Python docs :来自Python 文档:
search() vs. match()搜索()与匹配()
Python offers two different primitive operations based on regular expressions:
re.match()
checks for a match only at the beginning of the string, whilere.search()
checks for a match anywhere in the string ... Python 提供了两种不同的基于正则表达式的原始操作:re.match()
检查字符串开头的匹配,而re.search()
检查字符串中任何地方的匹配......
A naive approach:一种幼稚的方法:
1. sort strings by length, longest first # `O(N*log_N)`
2. foreach string: # O(N)
3. insert each suffix into tree structure: first letter -> root, and so on.
# O(L) or O(L^2) depending on string slice implementation, L: string length
4. if inserting the entire string (the longest suffix) creates a new
leaf node, keep it!
O[N*(log_N + L)] or O[N*(log_N + L^2)]
This is probably far from optimal, but should be significantly better than O(N^2)
for large N
(number of strings) and small L
(average string length).这可能远非最佳,但对于大N
(字符串数量)和小L
(平均字符串长度),应该明显优于O(N^2)
)。
You could also iterate through the strings in descending order by length and add all substrings of each string to a set, and only keep those strings that are not in the set.您还可以按长度降序遍历字符串,并将每个字符串的所有子字符串添加到一个集合中,并且只保留那些不在集合中的字符串。 The algorithmic big O should be the same as for the worse case above ( O[N*(log_N + L^2)]
), but the implementation is much simpler:算法大 O 应该与上面最坏的情况相同( O[N*(log_N + L^2)]
),但实现要简单得多:
seen_strings, keep_strings = set(), set()
for s in sorted(mystrings, key=len, reverse=True):
if s not in seen_strings:
keep_strings.add(s)
l = len(s)
for start in range(0, l-1):
for end in range(start+1, l):
seen_strings.add(s[start:end])
In the mean time I came up with this approach.与此同时,我想出了这种方法。
from Bio.trie import trie
unique_strings = set()
suffix_tree = trie()
for s in sorted(mystrings, key=len, reverse=True):
if suffix_tree.with_prefix(contig) == []:
unique_strings.add(s)
for i in range(len(s)):
suffix_tree[s[i:]] = 1
The good : ≈15 minutes --> ≈20 seconds for the data set I was working with.优点:≈15 分钟 --> ≈20 秒,用于我正在使用的数据集。 The bad : introduces biopython
as a dependency, which is neither lightweight nor pure python (as I originally asked).坏处:引入biopython
作为依赖项,它既不是轻量级也不是纯 python(正如我最初问的那样)。
You can presort the strings and create a dictionary that maps strings to positions in the sorted list.您可以对字符串进行预排序并创建一个字典,将字符串映射到排序列表中的位置。 Then you can loop over the list of strings (O(N)) and suffixes (O(L)) and set those entries to None
that exist in the position-dict (O(1) dict lookup and O(1) list update).然后,您可以遍历字符串列表 (O(N)) 和后缀 (O(L)),并将这些条目设置为位置字典中存在的None
(O(1) 字典查找和 O(1) 列表更新)。 So in total this has O(N*L) complexity where L
is the average string length.所以总的来说这有 O(N*L) 复杂度,其中L
是平均字符串长度。
strings = sorted(mystrings, key=len, reverse=True)
index_map = {s: i for i, s in enumerate(strings)}
unique = set()
for i, s in enumerate(strings):
if s is None:
continue
unique.add(s)
for k in range(1, len(s)):
try:
index = index_map[s[k:]]
except KeyError:
pass
else:
if strings[index] is None:
break
strings[index] = None
Testing on the following sample data gives a speedup factor of about 21:对以下样本数据的测试给出了大约 21 的加速因子:
import random
from string import ascii_lowercase
mystrings = [''.join(random.choices(ascii_lowercase, k=random.randint(1, 10)))
for __ in range(1000)]
mystrings = set(mystrings)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.