简体   繁体   English

将每条记录替换为 numpy 数组/熊猫 dataframe 中最接近的记录

[英]Replace each record with closest in numpy array/pandas dataframe

So, the situation is:所以,情况是:

I have two numpy 2d arrays/pandas dataframes (doesn't matter, what I will use).Each of them contains approximately 10 6 records.Each record is a row with 10 float numbers.我有两个 numpy 二维数组/熊猫数据帧(没关系,我将使用什么)。每个记录大约包含 10 6条记录。每条记录是一行,有 10 个浮点数。

I need to replace each row in second array(dataframe) with row from the first table, which has the smallest MSE compared to it.我需要用第一个表中的行替换第二个数组(数据帧)中的每一行,与它相比,它具有最小的 MSE。 I can easily do it with "for" loops, but it sounds horrifyingly slow.我可以用“for”循环轻松做到这一点,但听起来慢得可怕。 Is there nice and beautiful numpy/pandas solution I don't see?有没有我看不到的漂亮漂亮的 numpy/pandas 解决方案?

PS For example: PS 例如:

arr1: [[1,2,3],[4,5,6],[7,8,9]] arr1: [[1,2,3],[4,5,6],[7,8,9]]

arr2:[[9,10,11],[3,2,1],[5,5,5]] arr2:[[9,10,11],[3,2,1],[5,5,5]]

result should be:[[7,8,9],[1,2,3],[4,5,6]]结果应该是:[[7,8,9],[1,2,3],[4,5,6]]

in this example there are 3 numbers in each record and 3 records total.在此示例中,每条记录中有 3 个数字,总共 3 条记录。 I have 10 numbers in each record, and around 1000000 records total我在每条记录中有 10 个数字,总共大约 1000000 条记录

Using a nearest neighbor method should work here, especially if you want to cut down on computation time.使用最近邻方法应该在这里工作,特别是如果您想减少计算时间。

I'll give a simple example using scikit-learn 's NearestNeighbor class , though there are probably even more efficient ways to do this.我将使用scikit-learnNearestNeighbor class给出一个简单的示例,尽管可能有更有效的方法可以做到这一点。

import numpy as np
from sklearn.neighbors import NearestNeighbors

# Example data
X = np.random.randint(1000, size=(10000, 10))
Y = np.random.randint(1000, size=(10000, 10))

def map_to_nearest(source, query):
    neighbors = NearestNeighbors().fit(source)
    indices = neighbors.kneighbors(query, 1, return_distance=False)
    return query[indices.ravel()]

result = map_to_nearest(X, Y)

I'd note that this is calculating euclidean distances, not MSE.我注意到这是计算欧几里得距离,而不是 MSE。 This should be fine for finding the closest match, since MSE is the squared euclidean distance.这对于找到最接近的匹配应该没问题,因为 MSE 是平方欧几里得距离。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM