numpy数组中的匹配元素

Question

I have two numpy arrays. 我有两个numpy数组。 The first, Z1, is about 300,000 rows long and 3 columns wide. 第一个Z1大约长30万行，宽3列。 The second, Z2, is about 200,000 rows and 300 columns. 第二个Z2大约为200,000行300列。 Each row of each Z1 and Z2 has an identifying number (10-digit). Z1和Z2的每一行都有一个标识号（10位数字）。 Z2 contains a subset of the items in Z1, and I want to match the rows in Z2 with their partners in Z1 based on the 10-digit identifying number, then take columns 2 and 3 from Z1 and insert them at the end of Z2 in their appropriate rows. Z2包含Z1中项目的子集，我想基于10位标识号将Z2中的行与其Z1中的伙伴进行匹配，然后从Z1中提取第2列和第3列，并将它们插入到Z2中的末尾他们适当的行。

Neither Z1 nor Z2 are in any particular order. Z1和Z2都没有任何特定顺序。

The only way I've come up with to do this is by iterating over the arrays, which takes hours. 我想到的唯一方法是遍历数组，这需要几个小时。 Is there a better way to do this in Python? 在Python中有更好的方法吗？

Thanks! 谢谢！

Answer 1

I understand from your question that the 10-digit identifier is stored in column 1, right? 我从您的问题中了解到10位标识符存储在第1列中，对吗？

This is not very easy to follow, a lot of indirection going on, but in the end unsorted_insert has the row numbers of where in Z1 each identifier of Z2 is 这是不是很容易做到，很多间接的事情，但最终unsorted_insert拥有其中的行号Z1的每个标识符Z2是

sort_idx = np.argsort(Z1[:, 0])
sorted_insert = np.searchsorted(Z1[:, 0], Z2[:, 0], sorter=sort_idx)
# The following is equivalent to unsorted_insert = sort_idx[sorted_insert] but faster
unsorted_insert = np.take(sort_idx, sorted_insert)

So now all we need to do is to fetch the last two columns of those rows and stack them to the Z2 array: 因此，现在我们要做的就是获取这些行的最后两列，并将它们堆叠到Z2数组中：

new_Z2 = np.hstack((Z2, Z1[unsorted_insert, 1:]))

A made up example that runs with no issues: 一个没有问题的完整示例：

import numpy as np

z1_rows, z1_cols = 300000, 3
z2_rows, z2_cols = 200000, 300

z1 = np.arange(z1_rows*z1_cols).reshape(z1_rows, z1_cols)

z2 = np.random.randint(10000, size=(z2_rows, z2_cols))
z2[:, 0] = z1[np.random.randint(z1_rows, size=(z2_rows,)), 0]

sort_idx = np.argsort(z1[:, 0])
sorted_insert = np.searchsorted(z1[:, 0], z2[:, 0], sorter=sort_idx)
# The following is equivalent to unsorted_insert = sort_idx[sorted_insert] but faster
unsorted_insert = np.take(sort_idx, sorted_insert)
new_z2 = np.hstack((z2, z1[unsorted_insert, 1:]))

Haven't timed it, but the whole thing seems to complete in under a couple of seconds. 尚未计时，但整个过程似乎在几秒钟内完成。

numpy数组中的匹配元素

问题描述

1 个解决方案

解决方案1
2 已采纳 2013-06-28 23:45:01

numpy数组中的匹配元素

问题描述

1 个解决方案

解决方案1 2 已采纳 2013-06-28 23:45:01

解决方案1
2 已采纳 2013-06-28 23:45:01