简体   繁体   English

有效地搜索元组的一部分是否存在于元组列表中

[英]Effectively search if part of a tuple exist in a list of tuples

I have a tuple list which contains tuples of 6 digits, ranging from 01 to 99. For example:我有一个元组列表,其中包含 6 位数字的元组,范围从 01 到 99。例如:

tuple_list = {(01,02,03,04,05,06), (20,22,24,26,28,30), (02,03,04,05,06,99)}

For every tuple on this list I need to effectively search if there are any other tuples that have at least 5 numbers in common with it (excluding the searched number).对于此列表中的每个元组,我需要有效地搜索是否有任何其他元组至少有 5 个数字与之相同(不包括搜索到的数字)。 So for the above example, the result will be:所以对于上面的例子,结果将是:

(01,02,03,04,05,06) -> (02,03,04,05,06,99)
(20,22,24,26,28,30) -> []
(02,03,04,05,06,99) -> (01,02,03,04,05,06)

The list itself is big and can hold up to 1,000,000 records.列表本身很大,最多可以容纳 1,000,000 条记录。
I tried the naive approach of scanning the list one-by-one, but this has an O(n^2) complexity and takes a lot of time.我尝试了一个一个地扫描列表的天真方法,但这具有O(n^2)复杂度并且需要花费大量时间。
I thought about maybe using a dict but I can't find a way to search for part of a key (it would have worked fine if I needed to search for the exact key).我考虑过可能使用dict ,但我找不到一种方法来搜索部分密钥(如果我需要搜索确切的密钥,它会工作得很好)。 Maybe some sort of a suffix/prefix tree variation is needed, but I can't seem to figure it out.也许需要某种后缀/前缀树变体,但我似乎无法弄清楚。

Any help will be appreciated.任何帮助将不胜感激。

The code below generates a dict where they key is a 5-tuple and the value is a list of all the tuples that have those 5 elements.下面的代码生成一个字典,其中键是一个 5 元组,值是包含这 5 个元素的所有元组的列表。

It runs in O(nm) where n is the size of the tuple list and m is the size of each tuple.它在O(nm)中运行,其中n是元组列表的大小, m是每个元组的大小。 For 6-tuples, it runs in O(6n) .对于 6 元组,它在O(6n)中运行。 See test results below看下面的测试结果

def getCombos(tup):
    """
    Produces all combinations of the tuple with 1 missing
    element from the original
    """
    combos = []
    # sort the input tuple here if it's not already sorted
    for i in range(0, len(tup)):
        tupAsList = list(tup)
        del tupAsList[i]
        combos.append(tupAsList)
    return combos
    
def getKey(combo):
    """
    Creates a string key for a given combination
    """
    strCombo = [str(i) for i in combo]
    return ",".join(strCombo)

def findMatches(tuple_list):
    """
    Returns dict of tuples that match
    """
    matches = {}

    for tup in tuple_list:
        combos = getCombos(tup)
        for combo in combos:
            key = getKey(combo)
            if key in matches:
                matches[key].append(tup)
            else:
                matches[key] = [tup]
                
    # filter out matches with less than 2 elements (optional)
    matches = {k: v for k, v in matches.items() if len(v) > 1}

    return matches
    
    
tuple_list = [(01,02,03,04,05,06), (20,22,24,26,28,30), (02,03,04,05,06,99)]

print(findMatches(tuple_list)) # output: {'2,3,4,5,6': [(1, 2, 3, 4, 5, 6), (2, 3, 4, 5, 6, 99)]}

I tested this code against the brute force solution.我针对蛮力解决方案测试了这段代码。 For 1000 tuples, the brute force version took 5.5s whereas this solution took 0.03s.对于 1000 个元组,蛮力版本用了 5.5 秒,而这个解决方案用了 0.03 秒。 See repl here在这里查看回复

You can rearrange the output by iterating through the values but that may be unnecessary depending on how you're using it您可以通过遍历值来重新排列 output,但这可能是不必要的,具体取决于您使用它的方式

This process is inherently O(N^2) : you're making a comparison of N items to each of the other N-1 items.此过程本质上是O(N^2) :您正在将 N 项与其他 N-1 项中的每一项进行比较。 This is a distance metric, and the same theoretical results apply (you can look up all-to-all distance algorithms on Stack Overflow and elsewhere).这是一个距离度量,并且适用相同的理论结果(您可以在 Stack Overflow 和其他地方查找 all-to-all 距离算法)。 In most cases, there is not enough information to gather from f(A, B) and f(B, C) to predict whether f(A, C) is greater or less than 5.大多数情况下,没有足够的信息从f(A, B)f(B, C)中收集来预测f(A, C)是大于还是小于 5。

First of all, quit using tuples: they don't match your use case.首先,停止使用元组:它们与您的用例不匹配。 Tuples are indexed, and you don't care about the ordering.元组被索引,您不关心顺序。 (01, 02, 03, 04, 05, 06) and (05, 99, 03, 02, 01, 06) match in five numbers, despite having only two positional matches. (01, 02, 03, 04, 05, 06) 和 (05, 99, 03, 02, 01, 06) 匹配五个数字,尽管只有两个位置匹配。

Use the natural data types: sets.使用自然数据类型:集合。 Your comparison operation is len(A.intersection(B)) .您的比较操作是len(A.intersection(B)) Note that you can flip the logic to a straight distance metric: mismatch = len(AB) and have a little triangle logic, given that all the sets are the same size (see "triangle inequality").请注意,您可以将逻辑翻转为直线距离度量: mismatch = len(AB)并具有一点三角形逻辑,前提是所有集合的大小都相同(请参阅“三角形不等式”)。 For instance, if len(AB) is 1, then 5 numbers match.例如,如果len(AB)为 1,则匹配 5 个数字。 If you also get len(AC) is 5, then you know that that C differs from B in either 4 or 5 numbers, depending on which number did match.如果您还得到len(AC)为 5,那么您就知道C与 B 有 4 个或 5 个数字不同,具体取决于匹配的数字。

Given the sparsity of your sets (6 number from at least 99), you can gain a small amount of performance here... but the overhead and extra checking will likely consume your savings, and the resulting algorithm is still O(N^2) .考虑到你的集合的稀疏性(至少 99 中的 6 个数字),你可以在这里获得少量的性能......但是开销和额外的检查可能会消耗你的积蓄,并且由此产生的算法仍然是O(N ^ 2 )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM