简体   繁体   English

Python:对数据记录(元组列表)的最近邻居(或最接近匹配)过滤

[英]Python: nearest neighbour (or closest match) filtering on data records (list of tuples)

I am trying to write a function that will filter a list of tuples (mimicing an in-memory database), using a "nearest neighbour" or "nearest match" type algorithim. 我正在尝试编写一个函数,该函数使用“最近邻居”或“最近匹配”类型的算法来过滤元组列表(模拟内存数据库)。

I want to know the best (ie most Pythonic) way to go about doing this. 我想知道执行此操作的最佳方法(即大多数Python语言)。 The sample code below hopefully illustrates what I am trying to do. 下面的示例代码有望说明我正在尝试做的事情。

datarows = [(10,2.0,3.4,100),
            (11,2.0,5.4,120),
            (17,12.9,42,123)]

filter_record = (9,1.9,2.9,99) # record that we are seeking to retrieve from 'database' (or nearest match)
weights = (1,1,1,1) # weights to approportion to each field in the filter

def get_nearest_neighbour(data, criteria, weights):
    for each row in data:
        # calculate 'distance metric' (e.g. simple differencing) and multiply by relevant weight
    # determine the row which was either an exact match or was 'least dissimilar'
    # return the match (or nearest match)
    pass

if __name__ == '__main__':
    result = get_nearest_neighbour(datarow, filter_record, weights)
    print result

For the snippet above, the output should be: 对于上面的代码段,输出应为:

(10,2.0,3.4,100) (10,2.0,3.4,100)

since it is the 'nearest' to the sample data passed to the function get_nearest_neighbour(). 因为它是传递给函数get_nearest_neighbour()的样本数据的“最近”。

My question then is, what is the best way to implement get_nearest_neighbour() ?. 那么我的问题是,实现get_nearest_neighbour()的最佳方法是什么? For the purpose of brevity etc, assume that we are only dealing with numeric values, and that the 'distance metric' we use is simply an arithmentic subtraction of the input data from the current row. 为了简洁起见,假设我们仅处理数字值,并且我们使用的“距离度量”只是从当前行中输入数据的算术减法。

Simple out-of-the-box solution: 开箱即用的简单解决方案:

import math

def distance(row_a, row_b, weights):
    diffs = [math.fabs(a-b) for a,b in zip(row_a, row_b)]
    return sum([v*w for v,w in zip(diffs, weights)])

def get_nearest_neighbour(data, criteria, weights):
    def sort_func(row):
        return distance(row, criteria, weights)
    return min(data, key=sort_func)

If you'd need to work with huge datasets, you should consider switching to Numpy and using Numpy's KDTree to find nearest neighbors. 如果需要处理庞大的数据集,则应考虑切换到Numpy并使用Numpy的KDTree查找最近的邻居。 Advantage of using Numpy is that not only it uses more advanced algorithm, but also it's implemented a top of highly optimized LAPACK (Linear Algebra PACKage) . 使用Numpy的优势在于,它不仅使用更高级的算法,而且还实现了高度优化的LAPACK(线性代数PACKage)的顶部。

About naive-NN: 关于朴素的NN:

Many of these other answers propose "naive nearest-neighbor", which is an O(N*d) -per-query algorithm (d is the dimensionality, which in this case seems constant, so it's O(N) -per-query). 这些其他答案中的许多都建议使用“天真最近的邻居”,这是每个查询O(N*d)的算法(d是维数,在这种情况下,它似乎是常数,因此它是每个查询O(N) )。

While an O(N) -per-query algorithm is pretty bad, you might be able to get away with it, if you have less than any of (for example): 虽然O(N) -per-query算法非常糟糕,但如果少于(例如)以下任何一种,您也许可以摆脱它:

  • 10 queries and 100000 points 10个查询和100000点
  • 100 queries and 10000 points 100个查询和10000点
  • 1000 queries and 1000 points 1000个查询和1000点
  • 10000 queries and 100 points 10000个查询和100点
  • 100000 queries and 10 points 100000个查询和10分

Doing better than naive-NN: 比天真的NN做得更好:

Otherwise you will want to use one of the techniques (especially a nearest-neighbor data structure) listed in: 否则,您将需要使用以下技术之一(特别是最近邻居数据结构):

especially if you plan to run your program more than once. 特别是如果您计划多次运行程序。 There are most likely libraries available. 有最可能的库。 To otherwise not use a NN data structure would take too much time if you have a large product of #queries * #points. 如果您有大量的#queries * #points产品,那么不使用NN数据结构将花费太多时间。 As user 'dsign' points out in comments, you can probaby squeeze out a large additional constant factor of speed by using the numpy library. 正如用户“ dsign”在评论中指出的那样,您可以通过使用numpy库来挤出较大的速度常数。

However if you can get away with using the simple-to-implement naive-NN though, you should use it. 但是,如果您可以使用简单易用的naive-NN,则应该使用它。

use heapq.nlargest on a generator calculating the distance*weight for each record. 在生成器上使用heapq.nlargest计算每个记录的距离*权重。

something like: 就像是:

heapq.nlargest(N, ((row, dist_function(row,criteria,weight)) for row in data), operator.itemgetter(1))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM