简体   繁体   中英

Matching elements in numpy array

I have two numpy arrays. The first, Z1, is about 300,000 rows long and 3 columns wide. The second, Z2, is about 200,000 rows and 300 columns. Each row of each Z1 and Z2 has an identifying number (10-digit). Z2 contains a subset of the items in Z1, and I want to match the rows in Z2 with their partners in Z1 based on the 10-digit identifying number, then take columns 2 and 3 from Z1 and insert them at the end of Z2 in their appropriate rows.

Neither Z1 nor Z2 are in any particular order.

The only way I've come up with to do this is by iterating over the arrays, which takes hours. Is there a better way to do this in Python?

Thanks!

I understand from your question that the 10-digit identifier is stored in column 1, right?

This is not very easy to follow, a lot of indirection going on, but in the end unsorted_insert has the row numbers of where in Z1 each identifier of Z2 is

sort_idx = np.argsort(Z1[:, 0])
sorted_insert = np.searchsorted(Z1[:, 0], Z2[:, 0], sorter=sort_idx)
# The following is equivalent to unsorted_insert = sort_idx[sorted_insert] but faster
unsorted_insert = np.take(sort_idx, sorted_insert)

So now all we need to do is to fetch the last two columns of those rows and stack them to the Z2 array:

new_Z2 = np.hstack((Z2, Z1[unsorted_insert, 1:]))

A made up example that runs with no issues:

import numpy as np

z1_rows, z1_cols = 300000, 3
z2_rows, z2_cols = 200000, 300

z1 = np.arange(z1_rows*z1_cols).reshape(z1_rows, z1_cols)

z2 = np.random.randint(10000, size=(z2_rows, z2_cols))
z2[:, 0] = z1[np.random.randint(z1_rows, size=(z2_rows,)), 0]

sort_idx = np.argsort(z1[:, 0])
sorted_insert = np.searchsorted(z1[:, 0], z2[:, 0], sorter=sort_idx)
# The following is equivalent to unsorted_insert = sort_idx[sorted_insert] but faster
unsorted_insert = np.take(sort_idx, sorted_insert)
new_z2 = np.hstack((z2, z1[unsorted_insert, 1:]))

Haven't timed it, but the whole thing seems to complete in under a couple of seconds.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM