简体   繁体   English

搜索元组列表以查找匹配的 substring 的算法方法?

[英]Algorithmic way to search a list of tuples for a matching substring?

I have a list of tuples, about 100k entries.我有一个元组列表,大约有 100k 个条目。 Each tuple consists of an id and a string, my goal is to list the ids of the tuples, whose strings contain a substring from a given list of substrings.每个元组由一个 id 和一个字符串组成,我的目标是列出元组的 id,其字符串包含给定子字符串列表中的 substring。 My current solution is through set comprehension, ids can repeat.我目前的解决方案是通过集合理解,ids 可以重复。

tuples = [(id1, 'cheese trees'), (id2, 'freezy breeze'),...]
vals = ['cheese', 'flees']
ids = {i[0] for i in tuples if any(val in i[1] for val in vals)}

output: {id1}

Is there an algorithm that would allow doing that quicker?有没有一种算法可以更快地做到这一点? I'm interested in both exact substring matches and also possibly in the approximate ones.我对精确的 substring 匹配感兴趣,也可能对近似匹配感兴趣。 The main thing I'm after here is an algorithm that would offer speed advantage over the comprehension.我在这里追求的主要是一种算法,它比理解提供速度优势。

DISCLAIMER I'm the author of trrex免责声明我是trrex的作者

For the case of the exact matching , one approach for solving this, is to use a Trie , as mentioned in the comments.对于完全匹配的情况,解决此问题的一种方法是使用Trie ,如评论中所述。 trrex is a library that makes a Trie-Regex (a Trie in regex format) that can be used in conjunction with the regular expression engine of Python: trrex是一个制作 Trie-Regex(正则表达式格式的 Trie)的库,可以与 Python 的正则表达式引擎一起使用:

import random
import pandas as pd
import trrex as tx
import re

df = pd.read_csv('jeopardy-small.csv')
with open('words-sample') as infile:
    words = [line.strip() for line in infile]


tuples = [(random.randint(1, 250), sentence) for sentence in df['question']]


def fun_kislyuk(ws, ts):
    return {t[0] for t in ts if any(w in t[1] for w in ws)}


def fun_trrex(ws, ts):
    pattern = re.compile(tx.make(ws, left='', right=''))
    return {i for i, s in ts if pattern.search(s)}


if __name__ == "__main__":
    print(fun_trrex(words, tuples) == fun_kislyuk(words, tuples))

Output Output

True

The timings for the above functions are:上述功能的时间安排是:

%timeit fun_trrex(words, tuples)
11.3 ms ± 34.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit fun_kislyuk(words, tuples)
67.5 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The data is a list of around 2K questions from jeopardy, and 500 randomly chosen words.数据是来自 jeopardy 的大约 2000 个问题和 500 个随机选择的单词的列表。 You can findhere the resources for reproducing the experiments.你可以在这里找到重现实验的资源。

UPDATE更新

If you add the grouping strategy mentioned in the comments the time improvements increases, below is the function:如果添加评论中提到的分组策略,时间改进会增加,下面是 function:

def fun_grouping_trrex(ws, ts):
    pattern = re.compile(tx.make(ws, left='', right=''))
    groups = defaultdict(list)
    for i, s in ts:
        groups[i].append(s)

    return {i for i, vs in groups.items() if any(pattern.search(v) for v in vs)}

and the timings:和时间:

%timeit fun_trrex(words, tuples)
11.2 ms ± 61.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit fun_grouping_trrex(words, tuples)
4.96 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit fun_kislyuk(words, tuples)
67.4 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The approach of grouping + trrex gives you an approximated 10 times improvement on performance.分组+ trrex的方法使您的性能提高了大约10 倍 But take this last result with a grain of salt because it's very dependent on the dataset.但是对最后一个结果持保留态度,因为它非常依赖于数据集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM