连接numpy数组切片的最快方法

Question

我有大量不同大小的小型 numpy 数组（组），我想尽快连接这些组的任意子集。 我最初提出的解决方案是将这些组存储为 np.array 的 np.arrays ，然后使用列表索引访问组的子集：

groups = []
for i in range(100000):
    size = np.random.randint(3) + 1
    groups.append(np.random.randint(1000000, size=size))
groups = np.array(groups)  # dtype=np.object
indices = np.random.randint(len(groups), size=1000)

%%timeit
np.concatenate(groups[indices])
>>> 204 µs ± 395 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

然而，这个解决方案在内存消耗方面效率低下，因为组很小（平均 2 个元素），我必须为每个组存储一个 numpy 数组结构，这几乎是 100 字节（对我来说太多了）。

为了使解决方案更高效，我决定连接所有组并将数组边界存储在单独的数组中

data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups])
# ith group is data[offsets[i]: offsets[i + 1]]

但是，连接根本不明显。 像这样的东西：

%%timeit
np.concatenate([data[offsets[i]: offsets[i + 1]] for i in indices])
>>> 1.02 ms ± 44.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

工作速度比原始解决方案慢 5 倍。 我认为这是因为两件事。 首先，迭代 numpy 数组索引（python 将 c-int 包装到每个索引的对象中）。 其次，python 为每个切片/索引创建 numpy 结构。 我认为在纯 python 中减少这种情况的连接时间是不可能的，所以我决定想出一个 cython 解决方案。

%%cython
import numpy as np
ctypedef long long int64

def concatenate(data, offsets, indices):
    cdef int64[::] data_view = data
    cdef int64[::] indices_view = indices
    cdef int64[::] offsets_view = offsets
    
    size = np.sum(offsets[indices + 1]) - np.sum(offsets[indices])
    res = np.zeros(size, dtype=np.int64)
    cdef int64[::] res_view = res
    
    cdef int64 i, l = 0, r
    for i in indices_view:
        r = l + offsets_view[i + 1] - offsets_view[i]
        res_view[l: r] = data_view[offsets_view[i]: offsets_view[i + 1]]
        l = r
    return res

%%timeit
concatenate(data, offsets, indices)
>>> 277 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

此解决方案比以前的解决方案更快，但仍比原始解决方案慢一点。 但最大的问题是我事先不知道数据类型。 我在示例中使用了 int64，但它可以是任何数字类型，例如 float32。 因此，我不能像以前那样使用类型化的内存视图。 理论上，我只需要知道类型的大小（4/8 字节），如果我有指向数据和结果数组的指针，我可以使用 memcpy 或类似的东西来复制切片。 但我不知道如何在cython中做到这一点。 有没有办法做到这一点？

Answer 1

这是我的纯 numpy-only 解决方案adv_concatenate()函数。 与常规np.concatenate()相比，它提供了15x-47x倍的加速（在不同的机器上不同）。

注意：在第一个代码之后还有第二个更快的解决方案。

对于使用的 pip 模块timerit的时间测量，通过python -m pip install timerit 。 对于计时使用了两种类型的机器 - 第一台机器是基于 Windows 的，所有测试都相同，它是我的家用笔记本电脑，第二台机器是基于 Linux 的，每次测试都是不同的机器（因此不同测试之间的速度不同，但在一次运行/测试中速度相同），它也是我用来测试我的代码的repl.it站点的机器。

算法的想法是使用 numpy 的累积和函数（ .cumsum() ）：

我们创建只有1 s 的数组，数组的大小等于得到的连接数据数组的总大小。 该数组将保存要获取的所有data元素的索引以创建结果数据数组。
在这个数组的每个块（小子数组）的起始位置处，我们将值更改为这样一种方式，即在对整个结果数组运行cumsum()之后，这个起始值被转换为data数组的起始偏移量。 其余值保持1秒。
我们对这个数组应用.cumsum() 。 现在所有值都将保存要获取的数据元素的正确索引。
我们通过仅使用上面形成的索引数组索引data来形成结果获取的数据数组。

如果我们预先计算一些值，如offsets[1:] - offsets[:-1]或offsets[:-1] + 1并在adv_concatenate()函数中使用它们，则该算法可能会得到更大的提升。

在线尝试！

# Needs: python -m pip install numpy timerit
from timerit import Timerit
import numpy as np
np.random.seed(0)
Timerit._default_asciimode = True

groups = []
for i in range(100000):
    size = np.random.randint(3) + 1
    groups.append(np.random.randint(1000000, size = size))
groups = np.array(groups)  # dtype=np.object
indices = np.random.randint(len(groups), size = 1000)

data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups])

timer = lambda: Timerit(num = 600, verbose = 1)

print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()

def adv_concatenate(data, offsets, indices):
    begs, ends = offsets[indices], offsets[indices + 1]
    lens = ends - begs
    clens = lens.cumsum()
    ix = np.ones((clens[-1],), dtype = offsets.dtype)
    ix[0] = begs[0]
    ix[clens[:-1]] = begs[1:] - ends[:-1] + 1
    ix = ix.cumsum()
    return data[ix]
    
print('adv_concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        adv = adv_concatenate(data, offsets, indices)
tadv = tim.mean()
assert np.array_equal(ref, adv) # Check that our solution is correct

print('speedup:', round(tref / tadv, 3))

输出：

在第一台机器上（Windows）：

np.concatenate(): Timed best=3.129 ms, mean=3.225 +- 0.1 ms
adv_concatenate(): Timed best=191.137 us, mean=208.012 +- 20.7 us
speedup: 15.504

在第二台机器（Linux）上：

np.concatenate(): Timed best=1.666 ms, mean=2.314 +- 0.4 ms
adv_concatenate(): Timed best=35.596 us, mean=48.680 +- 15.4 us
speedup: 47.532

与常规np.concatenate()相比，第二种解决方案比第一种解决方案更快，提供40x-150x倍的加速（在不同的机器上不同）。 但第二种解决方案使用基于Numba JIT LLVM的编译器，需要通过python -m pip install numba 。

尽管它使用了额外的numba包，但中心函数adv_concatenate_indexes_numba()非常简单，代码行数与第一个解决方案相同。 算法也简单得多，两个简单的循环。

当前的解决方案适用于任何数据类型，因为中心函数只计算结果索引，因此根本不适用于data的 dtype。 如果 numba 函数将直接计算结果数据数组，而不是计算索引，则当前解决方案也可以提高10%-90% ，但这仅适用于 numba 支持的非常简单的数据类型，包括所有数字类型。 这是此改进解决方案的代码（或此处），可实现高达250x加速！ 在第二台机器（Linux）上这个改进版本的时间：

np.concatenate(): Timed best=1.640 ms, mean=3.403 +- 1.9 ms
adv_concatenate_numba(): Timed best=12.669 us, mean=17.235 +- 6.9 us
speedup: 197.46

接下来是更通用（仅索引计算）解决方案的代码：

在线尝试！

# Needs: python -m pip install numpy numba timerit
from timerit import Timerit
import numpy as np, numba
np.random.seed(0)
Timerit._default_asciimode = True

groups = []
for i in range(100000):
    size = np.random.randint(3) + 1
    groups.append(np.random.randint(1000000, size = size, dtype = np.int64))
groups = np.array(groups)  # dtype=np.object
indices = np.random.randint(len(groups), size = 1000, dtype = np.int64)

data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups], dtype = np.int64)

timer = lambda: Timerit(num = 600, verbose = 1)

print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()

@numba.njit('i8[:](i8[:], i8[:])', cache = True)
def adv_concatenate_indexes_numba(offsets, indices):
    tlen = 0
    for i in range(indices.size):
        ix = indices[i]
        tlen += offsets[ix + 1] - offsets[ix]
        
    pos, r = 0, np.empty((tlen,), dtype = offsets.dtype)
    for i in range(indices.size):
        ix = indices[i]
        for j in range(offsets[ix], offsets[ix + 1]):
            r[pos] = j
            pos += 1
            
    return r
    
def adv_concatenate2(data, offsets, indices):
    return data[adv_concatenate_indexes_numba(offsets, indices)]
    
adv_concatenate2(data, offsets, indices) # Once pre-compile Numba
print('adv_concatenate2(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        adv = adv_concatenate2(data, offsets, indices)
tadv = tim.mean()
assert np.array_equal(ref, adv) # Check that our solution is correct

print('speedup:', round(tref / tadv, 3))

输出：

在第一台机器上（Windows）：

np.concatenate(): Timed best=3.201 ms, mean=3.356 +- 0.1 ms
adv_concatenate2(): Timed best=79.681 us, mean=82.991 +- 6.7 us
speedup: 40.442

在第二台机器（Linux）上：

np.concatenate(): Timed best=1.541 ms, mean=2.220 +- 0.7 ms
adv_concatenate2(): Timed best=12.012 us, mean=14.830 +- 4.8 us
speedup: 149.716

受@pavelgramovich回答的 Cython 代码的启发，我还决定使用循环（func concatenate1() ）而不是memcpy()版本（func concatenate0() ）来实现我的简化版本，简化版本似乎比 memcpy 快1.5-2x倍当前测试数据的版本。 比较两个版本的完整代码如下：

在线尝试！

# Needs: python -m pip install numpy timerit cython setuptools
from timerit import Timerit
import numpy as np
np.random.seed(0)
Timerit._default_asciimode = True

groups = []
for i in range(100000):
    size = np.random.randint(3) + 1
    groups.append(np.random.randint(1000000, size = size, dtype = np.int64))
groups = np.array(groups)  # dtype=np.object
indices = np.random.randint(len(groups), size = 1000, dtype = np.int64)

data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups], dtype = np.int64)

timer = lambda: Timerit(num = 600, verbose = 1)

def compile_cy_cats():
    src = """
import numpy as np
cimport numpy as np
cimport cython
from libc.string cimport memcpy 

@cython.boundscheck(False)
@cython.wraparound(False)
def concatenate0(np.ndarray data, np.ndarray offsets, np.ndarray indices):
    data = np.ascontiguousarray(data)
    start_offsets = np.ascontiguousarray(offsets[indices], dtype=np.int64)
    end_offsets = np.ascontiguousarray(offsets[indices + 1], dtype=np.int64)
    cdef np.int64_t[::1] coffsets = start_offsets
    cdef np.int64_t[::1] csizes = end_offsets - start_offsets
    
    cdef np.int64_t i, total_size = 0
    for i in range(csizes.shape[0]):
        total_size += csizes[i]
    res = np.empty(total_size, dtype=data.dtype)

    cdef np.ndarray cdata = data
    cdef np.ndarray cres = res
    
    cdef np.int64_t itemsize = data.itemsize
    cdef np.int64_t res_offset = 0
    for i in range(csizes.shape[0]):
        memcpy(cres.data + res_offset * itemsize, 
               cdata.data + coffsets[i] * itemsize, 
               csizes[i] * itemsize)
        res_offset += csizes[i]
    
    return res

@cython.boundscheck(False)
@cython.wraparound(False)
def concatenate1(np.int64_t[:] data, np.int64_t[:] offsets, np.int64_t[:] indices):
    cdef np.int64_t tlen = 0, pos = 0, ix = 0, ixs = indices.size, i = 0, j = 0
    
    for i in range(ixs):
        ix = indices[i]
        tlen += offsets[ix + 1] - offsets[ix]
        
    r = np.empty(tlen, dtype = np.int64)
    cdef np.int64_t[:] cr = r, cdata = data

    for i in range(ixs):
        ix = indices[i]
        for j in range(offsets[ix], offsets[ix + 1]):
            cr[pos] = cdata[j]
            pos += 1
    
    return r
    """
    
    srcb = src.encode('utf-8')
    
    import hashlib, os, glob, importlib
    srch = hashlib.sha256(srcb).hexdigest().upper()[:8]

    if len(glob.glob(f'cy{srch}*')) == 0:
        with open(f'cys{srch}.pyx', 'wb') as f:
            f.write(srcb)

        import sys
        from setuptools import setup, Extension
        from Cython.Build import cythonize
        import numpy as np

        sys.argv += ['build_ext', '--inplace']
        setup(
            ext_modules = cythonize(
                Extension(f'cy{srch}', [f'cys{srch}.pyx']), language_level = 3, annotate = True,
            ),
            include_dirs = [np.get_include()],
        )
        del sys.argv[-2:]

    print('Cython module:', f'cy{srch}')
    return importlib.import_module(f'cy{srch}')

cy_cats = compile_cy_cats()
concatenate0, concatenate1 = cy_cats.concatenate0, cy_cats.concatenate1

print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()

concatenate0(data, offsets, indices) # Maybe pre-heat
print('cy_concatenate0(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        adv0 = concatenate0(data, offsets, indices)
tadv0 = tim.mean()
assert np.array_equal(ref, adv0) # Check that our solution is correct

print('speedup:', round(tref / tadv0, 3))

concatenate1(data, offsets, indices) # Maybe pre-heat
print('cy_concatenate1(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        adv1 = concatenate1(data, offsets, indices)
tadv1 = tim.mean()
assert np.array_equal(ref, adv1) # Check that our solution is correct

print('speedup:', round(tref / tadv1, 3))

输出：

第一台机器（Windows）：

Cython module: cy0BEBA0C8
np.concatenate(): Timed best=3.184 ms, mean=3.263 +- 0.1 ms
cy_concatenate0(): Timed best=119.767 us, mean=128.688 +- 10.7 us
speedup: 25.354
cy_concatenate1(): Timed best=86.525 us, mean=93.699 +- 20.5 us
speedup: 34.821

第二台机器（Linux）：

Cython module: cy0BEBA0C8
np.concatenate(): Timed best=1.630 ms, mean=2.215 +- 0.5 ms
cy_concatenate0(): Timed best=21.839 us, mean=28.930 +- 8.4 us
speedup: 76.578
cy_concatenate1(): Timed best=11.447 us, mean=15.263 +- 5.1 us
speedup: 145.151

Answer 2

我找到了一种将任意 dtype 数组的切片与 cython 连接的方法。 python numpy.ndarray 类有一个对应的 c。 它包含指向底层 c 数组开头的指针data和属性itemsize ，它以字节为单位存储单个元素的大小。 这样，就可以使用memcpy连接切片。

import numpy as np
cimport numpy as np
cimport cython
from libc.string cimport memcpy 

@cython.boundscheck(False)
@cython.wraparound(False)
def concatenate(np.ndarray data, np.ndarray offsets, np.ndarray indices):
    data = np.ascontiguousarray(data)
    start_offsets = np.ascontiguousarray(offsets[indices], dtype=np.int64)
    end_offsets = np.ascontiguousarray(offsets[indices + 1], dtype=np.int64)
    cdef np.int64_t[::1] coffsets = start_offsets
    cdef np.int64_t[::1] csizes = end_offsets - start_offsets
    
    cdef np.int64_t i, total_size = 0
    for i in range(csizes.shape[0]):
        total_size += csizes[i]
    res = np.empty(total_size, dtype=data.dtype)

    cdef np.ndarray cdata = data
    cdef np.ndarray cres = res
    
    cdef np.int64_t itemsize = data.itemsize
    cdef np.int64_t res_offset = 0
    for i in range(csizes.shape[0]):
        memcpy(cres.data + res_offset * itemsize, 
               cdata.data + coffsets[i] * itemsize, 
               csizes[i] * itemsize)
        res_offset += csizes[i]
    
    return res

%%timeit
concatenate(data, offsets, indices)
>>> 21.1 µs ± 24.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

连接numpy数组切片的最快方法

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-10-16 12:19:39

解决方案2
1 2020-10-15 12:43:36

连接numpy数组切片的最快方法

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-10-16 12:19:39

解决方案2 1 2020-10-15 12:43:36

解决方案1
3 已采纳 2020-10-16 12:19:39

解决方案2
1 2020-10-15 12:43:36