简体   繁体   English

基于第二个 Numpy 数组对 Numpy 数组元素进行分组的更快方法

[英]Quicker Method to Group Numpy Array Elements Based on Second Numpy Array

There are 2 NumPy arrays groups and selectors , where有 2 个 NumPy 数组groupsselectors ,其中

  1. selectors is an array containing integers that needs to be grouped selectors是一个包含需要分组的整数的数组
import numpy as np
np.random.seed(0)

selectors = np.random.randint(0, 300, 5)
# [172  47 117 192 251]
  1. groups is a structured array containing the first index (int) of a group (str) groups是包含组 (str) 的第一个索引 (int) 的结构化数组
# Generate groups `a` to `t` and their first index
start = ord('a')
groups = []
for i in range(20):
    e = (i*i, chr(start+i))
    groups.append(e)
groups = np.array(groups, dtype=[('index', np.uint32), ('selector', '|U1')])
groups = np.sort(groups, order='index')

# [(  0, 'a') (  1, 'b') (  4, 'c') (  9, 'd') ( 16, 'e') ( 25, 'f')
#  ( 36, 'g') ( 49, 'h') ( 64, 'i') ( 81, 'j') (100, 'k') (121, 'l')
#  (144, 'm') (169, 'n') (196, 'o') (225, 'p') (256, 'q') (289, 'r')
#  (324, 's') (361, 't')]

Given these example arrays, the desired result after grouping will be a dictionary of np.ndarrays /lists鉴于这些示例数组,分组后所需的结果将是np.ndarrays /lists 的字典

{
    "g": [47] ,
    "k": [117],
    "n": [172, 192],
    "p": [251]
}

Is there a quicker way to perform this grouping in Numpy instead of nesting 2 loops, as shown below?有没有更快的方法在 Numpy 中执行此分组而不是嵌套 2 个循环,如下所示? This will be useful for large selectors arrays with 10-100 million rows using groups array with 100-1000 rows.这对于使用具有 100-1000 行的groups数组的具有 10-1 亿行的大型selectors数组非常有用。

Using Nested Loops使用嵌套循环

results = {}
for s in selectors:
    for i in range(len(groups)-1):
        if s >= groups[i][0] and s < groups[i+1][0]:
            j = i
            break
    else:
        j = i + 1

    try:
        results[groups[j][1]].append(s)
    except KeyError:
        results[groups[j][1]] = [s]
print(results)
# {'n': [172, 192], 'g': [47], 'k': [117], 'p': [251]}

If you use binary search on each selector, you are effectively changing the time of your routine from O(len(groups) * len(selectors)) to O(log2(len(groups)) * len(selectors))如果您在每个选择器上使用二分搜索,您实际上将例程的时间从O(len(groups) * len(selectors))更改为O(log2(len(groups)) * len(selectors))

The Python documentation on the bisect module explains how to use it to find the right-most element less than or equal to a specified value. bisect模块的Python 文档解释了如何使用它来查找小于或等于指定值的最右侧元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM