简体   繁体   English

在Python中查找多个列表中最相似的数字

[英]Finding the most similar numbers across multiple lists in Python

In Python, I have 3 lists of floating-point numbers (angles), in the range 0-360, and the lists are not the same length. 在Python中,我有3个浮点数(角度)列表,范围在0-360之间,列表的长度不同。 I need to find the triplet (with 1 number from each list) in which the numbers are the closest. 我需要找到三元组(每个列表中有一个数字),其中数字最接近。 (It's highly unlikely that any of the numbers will be identical, since this is real-world data.) I was thinking of using a simple lowest-standard-deviation method to measure agreement, but I'm not sure of a good way to implement this. (任何数字都不太可能是相同的,因为这是现实世界的数据。)我想用一种简单的最低标准偏差方法来衡量协议,但我不确定一个好方法实现这一点。 I could loop through each list, comparing the standard deviation of every possible combination using nested for loops, and have a temporary variable save the indices of the triplet that agrees the best, but I was wondering if anyone had a better or more elegant way to do something like this. 我可以循环遍历每个列表,使用嵌套的for循环比较每个可能组合的标准偏差,并且有一个临时变量保存最好的三元组的索引,但我想知道是否有人有更好或更优雅的方式做这样的事情。 Thanks! 谢谢!

I wouldn't be surprised if there is an established algorithm for doing this, and if so, you should use it. 如果有一个已建立的算法来执行此操作,我不会感到惊讶,如果是这样,您应该使用它。 But I don't know of one, so I'm going to speculate a little. 但我不知道一个,所以我要推测一点。

If I had to do it, the first thing I would try would be just to loop through all possible combinations of all the numbers and see how long it takes. 如果我必须这样做,我会尝试的第一件事就是遍历所有数字的所有可能组合,看看它需要多长时间。 If your data set is small enough, it's not worth the time to invent a clever algorithm. 如果你的数据集足够小,那么发明一个聪明的算法是不值得的。 To demonstrate the setup, I'll include the sample code: 为了演示设置,我将包含示例代码:

# setup
def distance(nplet):
    '''Takes a pair or triplet (an "n-plet") as a list, and returns its distance.
    A smaller return value means better agreement.'''
    # your choice of implementation here. Example:
    return variance(nplet)

# algorithm
def brute_force(*lists):
    return min(itertools.product(*lists), key = distance)

For a large data set, I would try something like this: first create one triplet for each number in the first list, with its first entry set to that number. 对于大型数据集,我会尝试这样的事情:首先为第一个列表中的每个数字创建一个三元组,其第一个条目设置为该数字。 Then go through this list of partially-filled triplets and for each one, pick the number from the second list that is closest to the number from the first list and set that as the second member of the triplet. 然后浏览这个部分填充的三元组列表,并为每个三元组从第一个列表中选择最接近该数字的第二个列表中的数字,并将其设置为三元组的第二个成员。 Then go through the list of triplets and for each one, pick the number from the third list that is closest to the first two numbers (as measured by your agreement metric). 然后浏览三元组列表,对于每个三元组,从第三个列表中选择最接近前两个数字的数字(按协议指标衡量)。 Finally, take the best of the bunch. 最后,充分利用这一切。 This sample code demonstrates how you could try to keep the runtime linear in the length of the lists. 此示例代码演示了如何尝试将运行时线性保持在列表的长度中。

def item_selection(listA, listB, listC):
    # make the list of partially-filled triplets
    triplets = [[a] for a in listA]
    iT = 0
    iB = 0
    while iT < len(triplets):
        # make iB the index of a value in listB closes to triplets[iT][0]
        while iB < len(listB) and listB[iB] < triplets[iT][0]:
            iB += 1
        if iB == 0:
            triplets[iT].append(listB[0])
        elif iB == len(listB)
            triplets[iT].append(listB[-1])
        else:
            # look at the values in listB just below and just above triplets[iT][0]
            # and add the closer one as the second member of the triplet
            dist_lower = distance([triplets[iT][0], listB[iB]])
            dist_upper = distance([triplets[iT][0], listB[iB + 1]])
            if dist_lower < dist_upper:
                triplets[iT].append(listB[iB])
            elif dist_lower > dist_upper:
                triplets[iT].append(listB[iB + 1])
            else:
                # if they are equidistant, add both
                triplets[iT].append(listB[iB])
                iT += 1
                triplets[iT:iT] = [triplets[iT-1][0], listB[iB + 1]]
        iT += 1
    # then another loop while iT < len(triplets) to add in the numbers from listC
    return min(triplets, key = distance)

The thing is, I can imagine situations where this wouldn't actually find the best triplet, for instance if a number from the first list is close to one from the second list but not at all close to anything in the third list. 问题是,我可以想象这实际上不会找到最好的三元组的情况,例如,如果第一个列表中的数字接近第二个列表中的一个但是根本不接近第三个列表中的任何一个。 So something you could try is to run this algorithm for all 6 possible orderings of the lists. 所以你可以尝试的是为列表的所有6种可能的排序运行这个算法。 I can't think of a specific situation where that would fail to find the best triplet, but one might still exist. 我想不出一个特定的情况,那就是找不到最好的三重奏,但可能仍然存在。 In any case the algorithm will still be O(N) if you use a clever implementation, assuming the lists are sorted. 在任何情况下,如果您使用聪明的实现,假设列表已排序,则算法仍为O(N)。

def symmetrized_item_selection(listA, listB, listC):
    best_results = []
    for ordering in itertools.permutations([listA, listB, listC]):
        best_results.extend(item_selection(*ordering))
    return min(best_results, key = distance)

Another option might be to compute all possible pairs of numbers between list 1 and list 2, between list 1 and list 3, and between list 2 and list 3. Then sort all three lists of pairs together, from best to worst agreement between the two numbers. 另一种选择可能是计算列表1和列表2之间,列表1和列表3之间以及列表2和列表3之间所有可能的数字对。然后将所有三个对列表排在一起,从两者之间的最佳协议数字。 Starting with the closest pair, go through the list pair by pair and any time you encounter a pair which shares a number with one you've already seen, merge them into a triplet. 从最近的一对开始,逐个遍历列表对,只要遇到一对与您已经看过的数字共享一个数字的对,就将它们合并为三元组。 For a suitable measure of agreement, once you find your first triplet, that will give you a maximum pair distance that you need to iterate up to, and once you get up to it, you just choose the closest triplet of the ones you've found. 对于一个合适的协议度量,一旦你找到你的第一个三元组,这将给你一个你需要迭代的最大对距离,一旦你达到它,你只需选择你最接近的三元组找到。 I think that should consistently find the best possible triplet, but it will be O(N^2 log N) because of the requirement for sorting the lists of pairs. 我认为应该始终找到最好的三元组,但它将是O(N ^ 2 log N),因为需要对对的列表进行排序。

def pair_sorting(listA, listB, listC):
    # make all possible pairs of values from two lists
    # each pair has the structure ((number, origin_list),(number, origin_list))
    # so we know which lists the numbers came from
    all_pairs = []
    all_pairs += [((nA,0), (nB,1)) for (nA,nB) in itertools.product(listA,listB)]
    all_pairs += [((nA,0), (nC,2)) for (nA,nC) in itertools.product(listA,listC)]
    all_pairs += [((nB,1), (nC,2)) for (nB,nC) in itertools.product(listB,listC)]
    all_pairs.sort(key = lambda p: distance(p[0][0], p[1][0]))
    # make a dict to track which (number, origin_list)s we've already seen
    pairs_by_number_and_list = collections.defaultdict(list)
    min_distance = INFINITY
    min_triplet = None
    # start with the closest pair
    for pair in all_pairs:
        # for the first value of the current pair, see if we've seen that particular
        # (number, origin_list) combination before
        for pair2 in pairs_by_number_and_list[pair[0]]:
            # if so, that means the current pair shares its first value with
            # another pair, so put the 3 unique values together to make a triplet
            this_triplet = (pair[1][0], pair2[0][0], pair2[1][0])
            # check if the triplet agrees more than the previous best triplet
            this_distance = distance(this_triplet)
            if this_distance < min_distance:
                min_triplet = this_triplet
                min_distance = this_distance
        # do the same thing but checking the second element of the current pair
        for pair2 in pairs_by_number_and_list[pair[1]]:
            this_triplet = (pair[0][0], pair2[0][0], pair2[1][0])
            this_distance = distance(this_triplet)
            if this_distance < min_distance:
                min_triplet = this_triplet
                min_distance = this_distance
        # finally, add the current pair to the list of pairs we've seen
        pairs_by_number_and_list[pair[0]].append(pair)
        pairs_by_number_and_list[pair[1]].append(pair)
    return min_triplet

NB I've written all the code samples in this answer out a little more explicitly than you'd do it in practice to help you to understand how they work. 注意我在这个答案中写的所有代码示例都比你在实践中做的更明确,以帮助你理解它们是如何工作的。 But when doing it for real, you'd use more list comprehensions and such things. 但是当真实地做这件事时,你会使用更多的列表理解和类似的东西。

NB2. NB2。 No guarantees that the code works :-P but it should get the rough idea across. 不保证代码可以工作:-P但它应该得到粗略的想法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM