[英]Fastest way to find unique rows of 2D NumPy array with Cython
I have a 2D NumPy array that could be of any type, but for this example, we can assume it is integers. 我有一个可以是任何类型的2D NumPy数组,但是对于这个例子,我们可以假设它是整数。 I am looking to find the fastest way to find all the unique rows in the array.
我希望找到找到阵列中所有唯一行的最快方法。
My initial strategy was to convert each row into a tuple and add this to a set. 我最初的策略是将每一行转换为一个元组并将其添加到一个集合中。 If the length of the set increased, this would mean a unique row was found.
如果集合的长度增加,则意味着找到了唯一的行。
What I don't know how to do is quickly hash each row as bytes . 我不知道该怎么做是将每一行快速哈希为字节 。 There is a question where an entire array is hashed here .
有一个问题,这里整个数组被散列 。
What I tried - tuple creation 我尝试了什么 - 元组创作
There are many ways to create a tuple, and each one impacts performance. 有许多方法可以创建元组,每个方法都会影响性能。 Here is my function that I show 4 different variations:
这是我的功能,我显示了4种不同的变化:
def unique_int_tuple1(ndarray[np.int64_t, ndim=2] a):
cdef int i, len_before
cdef int nr = a.shape[0]
cdef int nc = a.shape[1]
cdef set s = set()
cdef ndarray[np.uint8_t, cast = True] idx = np.zeros(nr, dtype='bool')
for i in range(nr):
len_before = len(s)
s.add(tuple(a[i])) # THIS LINE IS CHANGED FOR ALL VERSIONS
if len(s) > len_before:
idx[i] = True
return idx
s.add(tuple([a[i, j] for j in range(nc)]))
vals
is a list with length equal to the number of columns vals
是一个长度等于列数的列表
for j in range(nc):
vals[j] = a[i, j]
s.add(tuple(vals))
s.add((a[i, 0], a[i, 1], a[i, 2], a[i, 3]))
a = np.random.randint(0, 8, (10**5, 4))
%timeit unique_int_tuple1(a)
125 ms ± 1.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit unique_int_tuple2(a)
14.5 ms ± 93.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit unique_int_tuple3(a)
11.7 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit unique_int_tuple4(a)
9.59 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Avoiding the tuple
constructor (version 4) results in a nice performance gain. 避免
tuple
构造函数(版本4)会带来不错的性能提升。
tostring
tostring
From the linked SO question above, I can use the tostring
method on each row and then hash this. 从上面链接的SO问题中,我可以在每一行上使用
tostring
方法,然后对此进行哈希处理。
def unique_int_tostring(ndarray[np.int64_t, ndim=2] a):
cdef int i, j
cdef int nr = a.shape[0]
cdef int nc = a.shape[1]
cdef set s = set()
cdef ndarray[np.uint8_t, cast = True] idx = np.zeros(nr, dtype='bool')
for i in range(nr):
len_before = len(s)
s.add(a[i].tostring())
if len(s) > len_before:
idx[i] = True
return idx
This works but is very slow: 这有效,但速度很慢:
%timeit unique_int_tostring(a)
40 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
A huge part of the slowdown, I believe, is the access of each row a[i]
. 我相信,减速的很大一部分是每行
a[i]
的访问。 We can used typed memoryviews to increase performance, but I don't know how to turn elements of typed memoryviews into strings so they can be hashed. 我们可以使用类型化的内存视图来提高性能,但我不知道如何将类型化的内存视图元素转换为字符串,以便对它们进行哈希处理。
def unique_int_memoryview(long[:, :] a):
cdef int i, j
cdef int nr = a.shape[0]
cdef int nc = a.shape[1]
cdef set s = set()
for i in range(nr):
s.add(<SOMETHING>) # NO IDEA HERE
return s
This, surprising to me, is slower, but for whatever it's worth, here is a c++ solution that does what you were pointing at - hash each row as a set of bytes. 令我感到意外的是,这个问题比较慢,但无论它值多少,这里都是一个c ++解决方案,可以完成您所指向的内容 - 将每一行散列为一组字节。 The 'trick' is taking the address of an element
<char*>&a[i, 0]
- most everything else is book-keeping. '技巧'是取一个元素
<char*>&a[i, 0]
- 其他大部分都是记账。
I may be doing some obviously sub-optimal and/or performance is likely better with a different hash table impl. 对于不同的哈希表impl,我可能会做一些显然是次优的和/或性能可能更好。
Edit: 编辑:
re: how to create a string from a row I think the best you could do is this - construct a bytes
object from the pointer. re:如何从一行创建一个字符串我认为你能做的最好是 - 从指针构造一个
bytes
对象。 This does necessarily involve a copy of the row see c api docs . 这必然涉及行的副本,请参阅c api docs 。
%%cython
from numpy cimport *
cimport numpy as np
import numpy as np
from cpython.bytes cimport PyBytes_FromStringAndSize
def unique_int_string(ndarray[np.int64_t, ndim=2] a):
cdef int i, len_before
cdef int nr = a.shape[0]
cdef int nc = a.shape[1]
cdef set s = set()
cdef ndarray[np.uint8_t, cast = True] idx = np.zeros(nr, dtype='bool')
cdef bytes string
for i in range(nr):
len_before = len(s)
string = PyBytes_FromStringAndSize(<char*>&a[i, 0], sizeof(np.int64_t) * nc)
s.add(string)
if len(s) > len_before:
idx[i] = True
return idx
// timing //时间
In [9]: from unique import unique_ints
In [10]: %timeit unique_int_tuple4(a)
100 loops, best of 3: 10.1 ms per loop
In [11]: %timeit unique_ints(a)
100 loops, best of 3: 11.9 ms per loop
In [12]: (unique_ints(a) == unique_int_tuple4(a)).all()
Out[12]: True
// helper.h // helper.h
#include <unordered_set>
#include <cstring>
struct Hasher {
size_t size;
size_t operator()(char* buf) const {
// https://github.com/yt-project/yt/blob/c1569367c6e3d8d0a02e10d0f3d0bd701d2e2114/yt/utilities/lib/fnv_hash.pyx
size_t hash_val = 2166136261;
for (int i = 0; i < size; ++i) {
hash_val ^= buf[i];
hash_val *= 16777619;
}
return hash_val;
}
};
struct Comparer {
size_t size;
bool operator()(char* lhs, char* rhs) const {
return (std::memcmp(lhs, rhs, size) == 0) ? true : false;
}
};
struct ArraySet {
std::unordered_set<char*, Hasher, Comparer> set;
ArraySet (size_t size) : set(0, Hasher{size}, Comparer{size}) {}
ArraySet () {}
bool add(char* buf) {
auto p = set.insert(buf);
return p.second;
}
};
// unique.pyx // unique.pyx
from numpy cimport int64_t, uint8_t
import numpy as np
cdef extern from 'helper.h' nogil:
cdef cppclass ArraySet:
ArraySet()
ArraySet(size_t)
bint add(char*)
def unique_ints(int64_t[:, :] a):
cdef:
Py_ssize_t i, nr = a.shape[0], nc = a.shape[1]
ArraySet s = ArraySet(sizeof(int64_t) * nc)
uint8_t[:] idx = np.zeros(nr, dtype='uint8')
bint found;
for i in range(nr):
found = s.add(<char*>&a[i, 0])
if found:
idx[i] = True
return idx
// setup.py // setup.py
from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np
exts = [
Extension('unique', ['unique.pyx'], language='c++', include_dirs=[np.get_include()])
]
setup(name='test', ext_modules=cythonize(exts))
You can use ndarray.view()
to change the dtype
to byte string
, and then use pandas.Series.duplicated()
to find duplicated rows: 您可以使用
ndarray.view()
来改变dtype
到byte string
,然后用pandas.Series.duplicated()
找到重复的行:
import numpy as np
a = np.random.randint(0, 5, size=(200, 3))
s = pd.Series(a.view(("S", a[0].nbytes))[:, 0])
s.duplicated()
the core algorithm of duplicated()
is implemented in Cython. duplicated()
的核心算法是在Cython中实现的。 However it need to convert the original array to an object array, which maybe slow. 但是,它需要将原始数组转换为对象数组,这可能会很慢。
To skip object array
, you can use the khash library that used by Pandas directly, here is the C code: 要跳过
object array
,可以直接使用Pandas使用的khash库 ,这里是C代码:
#include "khash.h"
typedef struct _Buf{
unsigned short n;
char * pdata;
} Buf;
khint32_t kh_buf_hash_func(Buf key)
{
int i;
char * s;
khint32_t hash = 0;
s = key.pdata;
for(i=0;i<key.n;i++)
{
hash += *s++;
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
khint32_t kh_buf_hash_equal(Buf a, Buf b)
{
int i;
if(a.n != b.n) return 0;
for(i=0;i<a.n;i++){
if(a.pdata[i] != b.pdata[i]) return 0;
}
return 1;
}
KHASH_INIT(buf, Buf, char, 0, kh_buf_hash_func, kh_buf_hash_equal)
void duplicated(char * arr, int row_size, int count, char * res)
{
kh_buf_t * khbuf;
Buf row;
int i, absent;
khint_t k;
row.n = row_size;
khbuf = kh_init_buf();
kh_resize_buf(khbuf, 4 * count);
for(i=0;i<count;i++){
row.pdata = &arr[i * row_size];
k = kh_put_buf(khbuf, row, &absent);
if (absent){
res[i] = 0;
}
else{
res[i] = 1;
}
}
kh_destroy_buf(khbuf);
}
then wrap the duplicated()
function by Cython or Ctypes or cffi. 然后用Cython或Ctypes或cffi包装
duplicated()
函数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.