使用Cython查找2D NumPy阵列的唯一行的最快方法

Question

I have a 2D NumPy array that could be of any type, but for this example, we can assume it is integers. 我有一个可以是任何类型的2D NumPy数组，但是对于这个例子，我们可以假设它是整数。 I am looking to find the fastest way to find all the unique rows in the array. 我希望找到找到阵列中所有唯一行的最快方法。

My initial strategy was to convert each row into a tuple and add this to a set. 我最初的策略是将每一行转换为一个元组并将其添加到一个集合中。 If the length of the set increased, this would mean a unique row was found. 如果集合的长度增加，则意味着找到了唯一的行。

What I don't know how to do is quickly hash each row as bytes . 我不知道该怎么做是将每一行快速哈希为字节 。 There is a question where an entire array is hashed here . 有一个问题，这里整个数组被散列。

What I tried - tuple creation 我尝试了什么 - 元组创作

There are many ways to create a tuple, and each one impacts performance. 有许多方法可以创建元组，每个方法都会影响性能。 Here is my function that I show 4 different variations: 这是我的功能，我显示了4种不同的变化：

Version 1: 版本1：

def unique_int_tuple1(ndarray[np.int64_t, ndim=2] a):
    cdef int i, len_before
    cdef int nr = a.shape[0]
    cdef int nc = a.shape[1]
    cdef set s = set()
    cdef ndarray[np.uint8_t, cast = True] idx = np.zeros(nr, dtype='bool')

    for i in range(nr):
        len_before = len(s)
        s.add(tuple(a[i]))        # THIS LINE IS CHANGED FOR ALL VERSIONS
        if len(s) > len_before:
            idx[i] = True
    return idx

Version 2: 版本2：

s.add(tuple([a[i, j] for j in range(nc)]))

Version 3: 版本3：

vals is a list with length equal to the number of columns vals是一个长度等于列数的列表

for j in range(nc):
    vals[j] = a[i, j]
    s.add(tuple(vals))

Version 4: 第4版：

s.add((a[i, 0], a[i, 1], a[i, 2], a[i, 3]))

Performance 性能

a = np.random.randint(0, 8, (10**5, 4))
%timeit unique_int_tuple1(a)
125 ms ± 1.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit unique_int_tuple2(a)
14.5 ms ± 93.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit unique_int_tuple3(a)
11.7 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit unique_int_tuple4(a)
9.59 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Avoiding the tuple constructor (version 4) results in a nice performance gain. 避免tuple构造函数（版本4）会带来不错的性能提升。

Using `tostring` 使用`tostring`

From the linked SO question above, I can use the tostring method on each row and then hash this. 从上面链接的SO问题中，我可以在每一行上使用tostring方法，然后对此进行哈希处理。

def unique_int_tostring(ndarray[np.int64_t, ndim=2] a):
    cdef int i, j
    cdef int nr = a.shape[0]
    cdef int nc = a.shape[1]
    cdef set s = set()
    cdef ndarray[np.uint8_t, cast = True] idx = np.zeros(nr, dtype='bool')

    for i in range(nr):
        len_before = len(s)
        s.add(a[i].tostring())
        if len(s) > len_before:
            idx[i] = True
    return idx

This works but is very slow: 这有效，但速度很慢：

%timeit unique_int_tostring(a)
40 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using typed memoryview 使用typed memoryview

A huge part of the slowdown, I believe, is the access of each row a[i] . 我相信，减速的很大一部分是每行a[i]的访问。 We can used typed memoryviews to increase performance, but I don't know how to turn elements of typed memoryviews into strings so they can be hashed. 我们可以使用类型化的内存视图来提高性能，但我不知道如何将类型化的内存视图元素转换为字符串，以便对它们进行哈希处理。

def unique_int_memoryview(long[:, :] a):
    cdef int i, j
    cdef int nr = a.shape[0]
    cdef int nc = a.shape[1]
    cdef set s = set()
    for i in range(nr):
        s.add(<SOMETHING>)   # NO IDEA HERE
    return s

Answer 1

This, surprising to me, is slower, but for whatever it's worth, here is a c++ solution that does what you were pointing at - hash each row as a set of bytes. 令我感到意外的是，这个问题比较慢，但无论它值多少，这里都是一个c ++解决方案，可以完成您所指向的内容 - 将每一行散列为一组字节。 The 'trick' is taking the address of an element <char*>&a[i, 0] - most everything else is book-keeping. '技巧'是取一个元素<char*>&a[i, 0] - 其他大部分都是记账。

I may be doing some obviously sub-optimal and/or performance is likely better with a different hash table impl. 对于不同的哈希表impl，我可能会做一些显然是次优的和/或性能可能更好。

Edit: 编辑：

re: how to create a string from a row I think the best you could do is this - construct a bytes object from the pointer. re：如何从一行创建一个字符串我认为你能做的最好是 - 从指针构造一个bytes对象。 This does necessarily involve a copy of the row see c api docs . 这必然涉及行的副本，请参阅c api docs 。

%%cython
from numpy cimport *
cimport numpy as np
import numpy as np
from cpython.bytes cimport PyBytes_FromStringAndSize

def unique_int_string(ndarray[np.int64_t, ndim=2] a):
    cdef int i, len_before
    cdef int nr = a.shape[0]
    cdef int nc = a.shape[1]
    cdef set s = set()
    cdef ndarray[np.uint8_t, cast = True] idx = np.zeros(nr, dtype='bool')
    cdef bytes string

    for i in range(nr):
        len_before = len(s)
        string = PyBytes_FromStringAndSize(<char*>&a[i, 0], sizeof(np.int64_t) * nc)
        s.add(string)
        if len(s) > len_before:
            idx[i] = True
    return idx

// timing //时间

In [9]: from unique import unique_ints

In [10]: %timeit unique_int_tuple4(a)
100 loops, best of 3: 10.1 ms per loop

In [11]: %timeit unique_ints(a)
100 loops, best of 3: 11.9 ms per loop

In [12]: (unique_ints(a) == unique_int_tuple4(a)).all()
Out[12]: True

// helper.h // helper.h

#include <unordered_set>
#include <cstring>

struct Hasher {
    size_t size;
    size_t operator()(char* buf) const {
        // https://github.com/yt-project/yt/blob/c1569367c6e3d8d0a02e10d0f3d0bd701d2e2114/yt/utilities/lib/fnv_hash.pyx
        size_t hash_val = 2166136261;
        for (int i = 0; i < size; ++i) {
                hash_val ^= buf[i];
                hash_val *= 16777619;
        }
        return hash_val;
    }
};
struct Comparer {
    size_t size;
    bool operator()(char* lhs, char* rhs) const {
        return (std::memcmp(lhs, rhs, size) == 0) ? true : false;
    }
};

struct ArraySet {
    std::unordered_set<char*, Hasher, Comparer> set;

    ArraySet (size_t size) : set(0, Hasher{size}, Comparer{size}) {}
    ArraySet () {}

    bool add(char* buf) {
        auto p = set.insert(buf);
        return p.second;
    }
};

// unique.pyx // unique.pyx

from numpy cimport int64_t, uint8_t
import numpy as np

cdef extern from 'helper.h' nogil:
    cdef cppclass ArraySet:
        ArraySet()
        ArraySet(size_t)
        bint add(char*)


def unique_ints(int64_t[:, :] a):
    cdef:
        Py_ssize_t i, nr = a.shape[0], nc = a.shape[1]
        ArraySet s = ArraySet(sizeof(int64_t) * nc)
        uint8_t[:] idx = np.zeros(nr, dtype='uint8')

        bint found;

    for i in range(nr):
        found = s.add(<char*>&a[i, 0])
        if found:
            idx[i] = True

    return idx

// setup.py // setup.py

from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np

exts = [
  Extension('unique', ['unique.pyx'], language='c++', include_dirs=[np.get_include()])
]

setup(name='test', ext_modules=cythonize(exts))

Answer 2

You can use ndarray.view() to change the dtype to byte string , and then use pandas.Series.duplicated() to find duplicated rows: 您可以使用ndarray.view()来改变dtype到byte string ，然后用pandas.Series.duplicated()找到重复的行：

import numpy as np

a = np.random.randint(0, 5, size=(200, 3))
s = pd.Series(a.view(("S", a[0].nbytes))[:, 0])
s.duplicated()

the core algorithm of duplicated() is implemented in Cython. duplicated()的核心算法是在Cython中实现的。 However it need to convert the original array to an object array, which maybe slow. 但是，它需要将原始数组转换为对象数组，这可能会很慢。

To skip object array , you can use the khash library that used by Pandas directly, here is the C code: 要跳过object array ，可以直接使用Pandas使用的khash库，这里是C代码：

#include "khash.h"

typedef struct _Buf{
    unsigned short n;
    char * pdata;
} Buf;

khint32_t kh_buf_hash_func(Buf key)
{
    int i;
    char * s;
    khint32_t hash = 0;
    s = key.pdata;
    for(i=0;i<key.n;i++)
    {
        hash += *s++;
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);    
    return hash;
}

khint32_t kh_buf_hash_equal(Buf a, Buf b)
{
    int i;
    if(a.n != b.n) return 0;
    for(i=0;i<a.n;i++){
        if(a.pdata[i] != b.pdata[i]) return 0;
    }
    return 1;
}

KHASH_INIT(buf, Buf, char, 0, kh_buf_hash_func, kh_buf_hash_equal)


void duplicated(char * arr, int row_size, int count, char * res)
{
    kh_buf_t * khbuf;
    Buf row;
    int i, absent;
    khint_t k;
    row.n = row_size;

    khbuf = kh_init_buf();
    kh_resize_buf(khbuf, 4 * count);

    for(i=0;i<count;i++){
        row.pdata = &arr[i * row_size];
        k = kh_put_buf(khbuf, row, &absent);
        if (absent){
            res[i] = 0;
        }
        else{
            res[i] = 1;
        }
    }    
    kh_destroy_buf(khbuf);
}

then wrap the duplicated() function by Cython or Ctypes or cffi. 然后用Cython或Ctypes或cffi包装duplicated()函数。

使用Cython查找2D NumPy阵列的唯一行的最快方法

问题描述

Version 1: 版本1：

Version 2: 版本2：

Version 3: 版本3：

Version 4: 第4版：

Performance 性能

Using `tostring` 使用`tostring`

Using typed memoryview 使用typed memoryview

2 个解决方案

解决方案1
2 已采纳 2018-02-20 01:20:23

解决方案2
2 2018-02-20 01:20:33

使用Cython查找2D NumPy阵列的唯一行的最快方法

问题描述

Version 1: 版本1：

Version 2: 版本2：

Version 3: 版本3：

Version 4: 第4版：

Performance 性能

Using tostring 使用tostring

Using typed memoryview 使用typed memoryview

2 个解决方案

解决方案1 2 已采纳 2018-02-20 01:20:23

解决方案2 2 2018-02-20 01:20:33

Using `tostring` 使用`tostring`

解决方案1
2 已采纳 2018-02-20 01:20:23

解决方案2
2 2018-02-20 01:20:33