简体   繁体   English

由多个向量组成的Numpy组,获得组索引

[英]Numpy group by multiple vectors, get group indices

I have several numpy arrays; 我有几个numpy数组; I want to build a groupby method that would have group ids for these arrays. 我想构建一个groupby方法,该方法具有这些数组的组ID。 It will then allow me to index these arrays on the group id to perform operations on the groups. 然后,它将允许我在组ID上索引这些数组以对组执行操作。

For an example: 举个例子:

import numpy as np
import pandas as pd
a = np.array([1,1,1,2,2,3])
b = np.array([1,2,2,2,3,3])

def group_np(groupcols):
    groupby = np.array([''.join([str(b) for b in bs]) for bs in zip(*[c for c in groupcols])])
    _, groupby = np.unique(groupby, return_invesrse=True)
   return groupby

def group_pd(groupcols):
    df = pd.DataFrame(groupcols[0])
    for i in range(1, len(groupcols)):
        df[i] = groupcols[i]
    for i in range(len(groupcols)):
        df[i] = df[i].fillna(-1)
    return df.groupby(list(range(len(groupcols)))).grouper.group_info[0]

Outputs: 输出:

group_np([a,b]) -> [0, 1, 1, 2, 3, 4]
group_pd([a,b]) -> [0, 1, 1, 2, 3, 4]

Is there a more efficient way of implementing it, ideally in pure numpy? 有没有更有效的方法来实现它,理想情况下是纯粹的numpy? The bottleneck currently seems to be building a vector that would have unique values for each group - at the moment I am doing that by concatenating the values for each vector as strings. 目前的瓶颈似乎是构建一个向每个组都有唯一值的向量 - 目前我通过将每个向量的值连接为字符串来实现这一点。

I want this to work for any number of input vectors, which can have millions of elements. 我希望这适用于任意数量的输入向量,这些向量可以包含数百万个元素。

Edit: here is another testcase: 编辑:这是另一个测试用例:

a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])

Here, group elements 2,3,4,7 should all be the same. 这里,组元素2,3,4,7应该都是相同的。

Edit2: adding some benchmarks. Edit2:添加一些基准测试。

a = np.random.randint(1, 1000, 30000000)
b = np.random.randint(1, 1000, 30000000)
c = np.random.randint(1, 1000, 30000000)

def group_np2(groupcols):
    _, groupby = np.unique(np.stack(groupcols), return_inverse=True, axis=1)
    return groupby

%timeit group_np2([a,b,c])
# 25.1 s +/- 1.06 s per loop (mean +/- std. dev. of 7 runs, 1 loop each)
%timeit group_pd([a,b,c])
# 21.7 s +/- 646 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

After using np.stack on the arrays a and b , if you set the parameter return_inverse to True in np.unique then it is the output you are looking for: 在数组ab上使用np.stack之后,如果在np.unique中将参数return_inverse设置为True ,那么它就是您要查找的输出:

a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
_, inv = np.unique(np.stack([a,b]), axis=1, return_inverse=True)
print (inv)

array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)

and you can replace [a,b] in np.stack by a list of all the vectors. 你可以用np.stack[a,b]替换所有向量的列表。

Edit : a faster solution is use np.unique on the sum of the arrays multiply by the cumulative product ( np.cumprod ) of the max plus 1 of all previous arrays in groupcols . 编辑 :更快的解决方案是使用np.uniquesum的阵列乘以累积产物( np.cumprod所述的) max加上所有先前阵列1 groupcols such as: 如:

def group_np_sum(groupcols):
    groupcols_max = np.cumprod([ar.max()+1 for ar in groupcols[:-1]])
    return np.unique( sum([groupcols[0]] +
                          [ ar*m for ar, m in zip(groupcols[1:],groupcols_max)]), 
                      return_inverse=True)[1]

To check: 去检查:

a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print (group_np_sum([a,b]))
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)

Note: the number associated to each group may not be the same (here I changed the first element of a by 3) 注意:关联到每个组的数量可以是不一样的(这里我改变的所述第一元件a由3)

a = np.array([3,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print(group_np2([a,b]))
print (group_np_sum([a,b]))
array([3, 1, 0, 0, 0, 2, 4, 0], dtype=int64)
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)

but groups themselves are the same. 但群体本身是一样的。

Now to check for timing: 现在检查时间:

a = np.random.randint(1, 100, 30000)
b = np.random.randint(1, 100, 30000)
c = np.random.randint(1, 100, 30000)
groupcols = [a,b,c]

%timeit group_pd(groupcols)
#13.7 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit group_np2(groupcols)
#34.2 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit group_np_sum(groupcols)
#3.63 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The numpy_indexed package (dsiclaimer: I am its authos) covers these type of use cases: numpy_indexed包(dsiclaimer:我是它的authos)涵盖了这些类型的用例:

import numpy_indexed as npi
npi.group_by((a, b))

Passing a tuple of index-arrays like this avoids creating a copy; 传递像这样的索引数组元组可以避免创建副本; but if you dont mind making the copy you can use stacking as well: 但如果您不介意制作副本,您也可以使用堆叠:

npi.group_by(np.stack(a, b))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM