简体   繁体   中英

Replace each record with closest in numpy array/pandas dataframe

So, the situation is:

I have two numpy 2d arrays/pandas dataframes (doesn't matter, what I will use).Each of them contains approximately 10 6 records.Each record is a row with 10 float numbers.

I need to replace each row in second array(dataframe) with row from the first table, which has the smallest MSE compared to it. I can easily do it with "for" loops, but it sounds horrifyingly slow. Is there nice and beautiful numpy/pandas solution I don't see?

PS For example:

arr1: [[1,2,3],[4,5,6],[7,8,9]]

arr2:[[9,10,11],[3,2,1],[5,5,5]]

result should be:[[7,8,9],[1,2,3],[4,5,6]]

in this example there are 3 numbers in each record and 3 records total. I have 10 numbers in each record, and around 1000000 records total

Using a nearest neighbor method should work here, especially if you want to cut down on computation time.

I'll give a simple example using scikit-learn 's NearestNeighbor class , though there are probably even more efficient ways to do this.

import numpy as np
from sklearn.neighbors import NearestNeighbors

# Example data
X = np.random.randint(1000, size=(10000, 10))
Y = np.random.randint(1000, size=(10000, 10))

def map_to_nearest(source, query):
    neighbors = NearestNeighbors().fit(source)
    indices = neighbors.kneighbors(query, 1, return_distance=False)
    return query[indices.ravel()]

result = map_to_nearest(X, Y)

I'd note that this is calculating euclidean distances, not MSE. This should be fine for finding the closest match, since MSE is the squared euclidean distance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM