简体   繁体   English

在一组字符串中查找子字符串

[英]Find substrings in a set of strings

I have a large (50k-100k) set of strings mystrings .我有一组大(50k-100k)的字符串mystrings Some of the strings in mystrings may be exact substrings of others, and I would like to collapse these (discard the substring and only keep the longest). mystrings一些字符串可能是其他字符串的精确子字符串,我想折叠它们(丢弃子字符串,只保留最长的)。 Right now I'm using a naive method, which has O(N^2) complexity.现在我正在使用一种简单的方法,它的复杂度为O(N^2)

unique_strings = set()
for s in sorted(mystrings, key=len, reverse=True):
    keep = True
    for us in unique_strings:
        if s in us:
            keep = False
            break
    if keep:
        unique_strings.add(s)

Which data structures or algorithms would make this task easier and not require O(N^2) operations.哪些数据结构或算法将使此任务更容易,并且不需要O(N^2)操作。 Libraries are ok, but I need to stay pure Python.图书馆没问题,但我需要保持纯 Python。

Finding a substring in a set():在 set() 中查找子字符串:

name = set()
name.add('Victoria Stuart')                         ## add single element
name.update(('Carmine Wilson', 'Jazz', 'Georgio'))  ## add multiple elements
name
{'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}

me = 'Victoria'
if str(name).find(me):
    print('{} in {}'.format(me, name))
# Victoria in {'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}

That's pretty easy -- but somewhat problematic, if you want to return the matching string:这很简单——但如果你想返回匹配的字符串,就有点问题:

for item in name:
    if item.find(me):
            print(item)
'''
Jazz
Georgio
Carmine Wilson
'''

print(str(name).find(me))
# 39    ## character offset for match (i.e., not a string)

As you can see, the loop above only executes until the condition is True , terminating before printing the item we want (the matching string).如您所见,上面的循环只执行直到条件为True ,在打印我们想要的项目(匹配字符串)之前终止。

It's probably better, easier to use regex (regular expressions):使用正则表达式(正则表达式)可能更好,更容易:

import re

for item in name:
    if re.match(me, item):
            full_name = item
            print(item)
# Victoria Stuart
print(full_name)
# Victoria Stuart

for item in name:
    if re.search(me, item):
            print(item)
# Victoria Stuart

From the Python docs :来自Python 文档

search() vs. match()搜索()与匹配()

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string ... Python 提供了两种不同的基于正则表达式的原始操作: re.match()检查字符串开头的匹配,而re.search()检查字符串中任何地方的匹配......

A naive approach:一种幼稚的方法:

1. sort strings by length, longest first  # `O(N*log_N)`
2. foreach string:  # O(N)
    3. insert each suffix into tree structure: first letter -> root, and so on.  
       # O(L) or O(L^2) depending on string slice implementation, L: string length
    4. if inserting the entire string (the longest suffix) creates a new 
       leaf node, keep it!

O[N*(log_N + L)]  or  O[N*(log_N + L^2)]

This is probably far from optimal, but should be significantly better than O(N^2) for large N (number of strings) and small L (average string length).这可能远非最佳,但对于大N (字符串数量)和小L (平均字符串长度),应该明显优于O(N^2) )。

You could also iterate through the strings in descending order by length and add all substrings of each string to a set, and only keep those strings that are not in the set.您还可以按长度降序遍历字符串,并将每个字符串的所有子字符串添加到一个集合中,并且只保留那些不在集合中的字符串。 The algorithmic big O should be the same as for the worse case above ( O[N*(log_N + L^2)] ), but the implementation is much simpler:算法大 O 应该与上面最坏的情况相同( O[N*(log_N + L^2)] ),但实现要简单得多:

seen_strings, keep_strings = set(), set()
for s in sorted(mystrings, key=len, reverse=True):
    if s not in seen_strings:
        keep_strings.add(s)
        l = len(s)
        for start in range(0, l-1):
            for end in range(start+1, l):
                seen_strings.add(s[start:end])

In the mean time I came up with this approach.与此同时,我想出了这种方法。

from Bio.trie import trie
unique_strings = set()
suffix_tree = trie()
for s in sorted(mystrings, key=len, reverse=True):
    if suffix_tree.with_prefix(contig) == []:
        unique_strings.add(s)
        for i in range(len(s)):
            suffix_tree[s[i:]] = 1

The good : ≈15 minutes --> ≈20 seconds for the data set I was working with.优点:≈15 分钟 --> ≈20 秒,用于我正在使用的数据集。 The bad : introduces biopython as a dependency, which is neither lightweight nor pure python (as I originally asked).坏处:引入biopython作为依赖项,它既不是轻量级也不是纯 python(正如我最初问的那样)。

You can presort the strings and create a dictionary that maps strings to positions in the sorted list.您可以对字符串进行预排序并创建一个字典,将字符串映射到排序列表中的位置。 Then you can loop over the list of strings (O(N)) and suffixes (O(L)) and set those entries to None that exist in the position-dict (O(1) dict lookup and O(1) list update).然后,您可以遍历字符串列表 (O(N)) 和后缀 (O(L)),并将这些条目设置为位置字典中存在的None (O(1) 字典查找和 O(1) 列表更新)。 So in total this has O(N*L) complexity where L is the average string length.所以总的来说这有 O(N*L) 复杂度,其中L是平均字符串长度。

strings = sorted(mystrings, key=len, reverse=True)
index_map = {s: i for i, s in enumerate(strings)}
unique = set()
for i, s in enumerate(strings):
    if s is None:
        continue
    unique.add(s)
    for k in range(1, len(s)):
        try:
            index = index_map[s[k:]]
        except KeyError:
            pass
        else:
            if strings[index] is None:
                break
            strings[index] = None

Testing on the following sample data gives a speedup factor of about 21:对以下样本数据的测试给出了大约 21 的加速因子:

import random
from string import ascii_lowercase

mystrings = [''.join(random.choices(ascii_lowercase, k=random.randint(1, 10)))
             for __ in range(1000)]
mystrings = set(mystrings)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 是一组字符串,另一组是所有子字符串 - are set of strings all substrings of another set 查找并替换字符串,这些字符串是不同的词,而不是子字符串? - Find and replace strings which are distinct words, but not substrings? 从许多字符串中找到最好的代表性子字符串 - Find the best representative substrings from many strings 生成一组字符串及其子字符串的所有组合-python - Generate all combinations of strings and their substrings in a set — python 如何通过搜索 List1 中的子字符串来查找 List2 中的完整字符串? - How to find the full strings in List2 by searching with the substrings in List1? 如何找到数组中的哪些字符串是python中另一个字符串的子字符串? - How to find which strings in an array are substrings to another string in python? 用于查找两个字符串中未提供正确输出的所有公共子字符串的函数 - Function to find all common substrings in two strings not giving correct output 查找两个字符串之间的所有公共子字符串,不考虑大小写和顺序 - Find all the common substrings between two strings, regardless of case and order 如何查找字符串中子字符串的出现次数并将其存储到 Python 字典中? - How to find and store the number of occurrences of substrings in strings into a Python dictionary? 使用正则表达式搜索字符串列表以查找子字符串Python - Searching List of Strings Using Regex to Find Substrings Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM