将数组排序为索引数组指定的bin的最有效方法？

Question

The task by example: 任务示例：

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
idx  = np.array([2, 0, 1, 1, 2, 0, 1, 1, 2])

Expected result: 预期结果：

binned = np.array([2, 6, 3, 4, 7, 8, 1, 5, 9])

Constraints: 约束：

Should be fast. 应该快。
Should be O(n+k) where n is the length of data and k is the number of bins. 应该是O(n+k) ，其中n是数据的长度，k是bin的数量。
Should be stable, ie order within bins is preserved. 应该是稳定的，即保留在箱内的顺序。

Obvious solution 明显的解决方案

data[np.argsort(idx, kind='stable')]

is O(n log n) . 是O(n log n) 。

O(n+k) solution O(n+k)溶液

def sort_to_bins(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    cnts = np.zeros(mx + 1, int)
    for i in range(idx.size):
        cnts[idx[i] + 1] += 1
    for i in range(1, cnts.size):
        cnts[i] += cnts[i-1]
    res = np.empty_like(data)
    for i in range(data.size):
        res[cnts[idx[i]]] = data[i]
        cnts[idx[i]] += 1
    return res

is loopy and slow. 环路缓慢。

Is there a better method in pure numpy < scipy < pandas < numba / pythran ? 在纯粹的numpy < scipy < pandas < numba / pythran有更好的方法吗？

Answer 1

Here are a few solutions: 以下是一些解决方案：

Use np.argsort anyway, after all it is fast compiled code. 无论如何，使用np.argsort ，毕竟它是快速编译的代码。
Use np.bincount to get the bin sizes and np.argpartition which is O(n) for fixed number of bins. 使用np.bincount获取bin大小和np.argpartition ，对于固定数量的bin，使用O(n) 。 Downside: currently, no stable algorithm is available, thus we have to sort each bin. 缺点：目前，没有稳定的算法可用，因此我们必须对每个bin进行排序。
Use scipy.ndimage.measurements.labeled_comprehension . 使用scipy.ndimage.measurements.labeled_comprehension 。 This does roughly what is required, but no idea how it is implemented. 这大致是需要的，但不知道它是如何实现的。
Use pandas . 使用pandas 。 I'm a complete pandas noob, so what I cobbled together here using groupby may be suboptimal. 我是一个完整的pandas菜鸟，所以我在这里用groupby拼凑在一起可能不是最理想的。
Use scipy.sparse switching between compressed sparse row and compressed sparse column formats happens to implement the exact operation we are looking for. 使用压缩稀疏行和压缩稀疏列格式之间的scipy.sparse切换实现我们正在寻找的确切操作。
Use pythran (I'm sure numba works as well) on the loopy code in the question. 在问题中的循环代码中使用pythran （我确定numba正常工作）。 All that is required is to insert at the top after numpy import 所有需要的是在numpy导入后插入顶部

. 。

#pythran export sort_to_bins(int[:], float[:], int)

and then compile 然后编译

# pythran stb_pthr.py

Benchmarks 100 bins, variable number of items: 基准100箱，可变数量的物品：

Take home: 带回家：

If you are ok with numba / pythran that is the way to go, if not scipy.sparse scales rather well. 如果你对numba / pythran没scipy.sparse就可以了，如果不是scipy.sparse扩展。

Code: 码：

import numpy as np
from scipy import sparse
from scipy.ndimage.measurements import labeled_comprehension
from stb_pthr import sort_to_bins as sort_to_bins_pythran
import pandas as pd

def sort_to_bins_pandas(idx, data, mx=-1):
    df = pd.DataFrame.from_dict(data=data)
    out = np.empty_like(data)
    j = 0
    for grp in df.groupby(idx).groups.values():
        out[j:j+len(grp)] = data[np.sort(grp)]
        j += len(grp)
    return out

def sort_to_bins_ndimage(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    out = np.empty_like(data)
    j = 0
    def collect(bin):
        nonlocal j
        out[j:j+len(bin)] = np.sort(bin)
        j += len(bin)
        return 0
    labeled_comprehension(data, idx, np.arange(mx), collect, data.dtype, None)
    return out

def sort_to_bins_partition(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    return data[np.argpartition(idx, np.bincount(idx, None, mx)[:-1].cumsum())]

def sort_to_bins_partition_stable(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    split = np.bincount(idx, None, mx)[:-1].cumsum()
    srt = np.argpartition(idx, split)
    for bin in np.split(srt, split):
        bin.sort()
    return data[srt]

def sort_to_bins_sparse(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1    
    return sparse.csr_matrix((data, idx, np.arange(len(idx)+1)), (len(idx), mx)).tocsc().data

def sort_to_bins_argsort(idx, data, mx=-1):
    return data[idx.argsort(kind='stable')]

from timeit import timeit
exmpls = [np.random.randint(0, K, (N,)) for K, N in np.c_[np.full(16, 100), 1<<np.arange(5, 21)]]

timings = {}
for idx in exmpls:
    data = np.arange(len(idx), dtype=float)
    ref = None
    for x, f in (*globals().items(),):
        if x.startswith('sort_to_bins_'):
            timings.setdefault(x.replace('sort_to_bins_', '').replace('_', ' '), []).append(timeit('f(idx, data, -1)', globals={'f':f, 'idx':idx, 'data':data}, number=10)*100)
            if x=='sort_to_bins_partition':
                continue
            if ref is None:
                ref = f(idx, data, -1)
            else:
                assert np.all(f(idx, data, -1)==ref)

import pylab
for k, v in timings.items():
    pylab.loglog(1<<np.arange(5, 21), v, label=k)
pylab.xlabel('#items')
pylab.ylabel('time [ms]')
pylab.legend()
pylab.show()

将数组排序为索引数组指定的bin的最有效方法？

问题描述

1 个解决方案

解决方案1
2 2019-03-18 17:11:25

将数组排序为索引数组指定的bin的最有效方法？

问题描述

1 个解决方案

解决方案1 2 2019-03-18 17:11:25

解决方案1
2 2019-03-18 17:11:25