如何使用这个用 Numpy 构建的集合字典来提高性能？

Question

Let's say we have 10 millions of socks, and each of them has color sockcolor[i] and is stored in drawer[i] .假设我们有 1000 万只袜子，每只袜子都有颜色sockcolor[i]并存储在drawer[i]中。 I'd like to count how many different colors there are in each of the 20 drawers.我想计算 20 个抽屉中的每个抽屉有多少种不同的 colors。

To do this, I used a dictionary containing sets (sets use a hashtable under the hood which is handy to count unique elements).为此，我使用了一个包含集合的字典（集合在引擎盖下使用哈希表，可以方便地计算唯一元素）。

The following code works, but it's quite slow (~10 seconds for 10 millions of socks).以下代码有效，但速度很慢（1000 万只袜子大约需要 10 秒）。

How could we use Numpy techniques (vectorization?) or avoid the for loop to speed up this computation?我们如何使用 Numpy 技术（矢量化？）或避免for循环来加速计算？

import numpy as np
N = 10*1000*1000
drawer = np.random.randint(0, 20, N)            # 20 drawers
sockcolor = np.random.randint(0, 100*1000, N)   # 100k different colors
d = {}

for i, k in enumerate(drawer):
    if k not in d:  # can be simplified with collections.defaultdict but no gain here
        d[k] = set()
    d[k].add(sockcolor[i])

for k, s in d.items():
    print(k, len(s))

Output: Output：

Answer 1

You basically already have a mapping from drawers to sockcolors, but they are randomised and you want to organise them by drawer number.您基本上已经有了从抽屉到袜子颜色的映射，但是它们是随机的，您想按抽屉编号来组织它们。

The easiest thing to do is to first sort them by drawer numbers:最简单的做法是首先按抽屉编号对它们进行排序：

drawer_sort = np.argsort(drawer)
drawer = drawer[drawer_sort]
sockcolor = sockcolor[drawer_sort]

Now that they are sorted, there is no need to look for drawer number duplicates, you just have to find the indices at which the drawer numbers change, to form ranges, which are these:现在它们已排序，无需查找抽屉编号重复项，您只需找到抽屉编号更改的索引，以形成范围，它们是：

changes, = np.where(drawer[1:]-drawer[:-1])
starts = np.concatenate([[0], changes+1])
ends = np.concatenate([changes, [len(drawer)]])

Now you can create your dictionary:现在你可以创建你的字典了：

result = {drawer[start]: sockcolor[start:end] for start, end in zip(starts, ends)}

This way, the only itaration done in Python is the last line, which will be really fast if there are a small number of distinct drawer values (in your case not more than 20).这样，在 Python 中完成的唯一迭代是最后一行，如果有少量不同的drawer值（在您的情况下不超过 20），这将非常快。

The result can still have duplicate sockcolor values, but that is easily solved in numpy:结果仍然可能有重复的sockcolor值，但这在 numpy 中很容易解决：

result = {drawer: np.unique(sockcolors) for drawer, sockcolors in result.items()}

Answer 2

Your slowness comes from failing to use the built-in features of your sequences.您的缓慢来自未能使用序列的内置功能。 Iterate through the individual socks only once.仅对单个袜子进行一次迭代。 Instead, assign the sock colors (not indices of individual socks) to the drawers.相反，将袜子 colors（不是单个袜子的索引）分配给抽屉。 Then make a set from each drawer's contents: one wholesale operation, rather than an incremental set.add , which is relatively slow for your purposes.然后从每个抽屉的内容中创建一个集合：一个批发操作，而不是增量set.add ，这对于您的目的而言相对较慢。

如何使用这个用 Numpy 构建的集合字典来提高性能？

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-01-05 17:52:54

解决方案2
0 2021-01-05 17:20:30

如何使用这个用 Numpy 构建的集合字典来提高性能？

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-01-05 17:52:54

解决方案2 0 2021-01-05 17:20:30

解决方案1
2 已采纳 2021-01-05 17:52:54

解决方案2
0 2021-01-05 17:20:30