[英]numpy fastest way to transform an array's elements to their frequency
As the title said, I am looking for a way to transform an array so it will be the array of frequency of its proper elements. 正如标题所说,我正在寻找一种方法来转换数组,因此它将是其适当元素的频率数组。
I found np.count
and np.histogram
but it's not what I am looking for 我找到了
np.count
和np.histogram
但它不是我想要的
Something like: 就像是:
From: 从:
array_ = np.array([0,0,0,1,0,0,2,0,0,1,2,0])
To: 至:
array_ = np.array([8,8,8,2,8,8,2,8,8,2,2,8])
Thanks in advance! 提前致谢!
If the values in your array are nonnegative integers which aren't too large, you can use np.bincount
. 如果数组中的值是非负整数,而不是太大,则可以使用
np.bincount
。 Using your original array as an index into the bincount
result gives your desired output. 使用原始数组作为
bincount
结果的索引, bincount
得到所需的输出。
>>> array_ = np.array([0,0,0,1,0,0,2,0,0,1,2,0])
>>> np.bincount(array_)
array([8, 2, 2])
>>> np.bincount(array_)[array_]
array([8, 8, 8, 2, 8, 8, 2, 8, 8, 2, 2, 8])
Bear in mind that the result of np.bincount
has size max(array_) + 1
, so if your array has large values this approach is inefficient: you end up creating a very large intermediate result. 请记住,
np.bincount
的结果的大小为max(array_) + 1
,所以如果你的数组值很大,这种方法效率很低:你最终会创建一个非常大的中间结果。
An alternative approach that should be efficient even with large or negative inputs is to use np.unique
with the return_inverse
and return_counts
arguments, as follows: 即使对于大输入或负输入也应该有效的替代方法是将
np.unique
与return_inverse
和return_counts
参数一起使用,如下所示:
>>> array_ = np.array([0,0,0,1,0,0,2,0,0,1,2,0])
>>> _, inv, counts = np.unique(array_, return_inverse=True, return_counts=True)
>>> counts[inv]
array([8, 8, 8, 2, 8, 8, 2, 8, 8, 2, 2, 8])
Note that the return_counts
argument is new in NumPy 1.9.0, so you'll need an up-to-date version of NumPy. 请注意,
return_counts
参数是NumPy 1.9.0中的新参数,因此您需要一个NumPy的最新版本。 If you don't have NumPy 1.9.0, all is not lost! 如果你没有NumPy 1.9.0,一切都不会丢失! You can still use the
return_inverse
argument of np.unique
, which gives you back an array of small integers in the same arrangement as your original one. 你仍然可以使用
return_inverse
的说法np.unique
,它给你回小整数的相同排列的数组作为您的原始之一。 That new array is now in perfect shape for bincount
to work on it efficiently: 这个新阵列现在处于完美状态,以便
bincount
地处理它:
>>> array_ = np.array([0,0,0,1,0,0,2,0,0,1,2,0])
>>> _, inverse = np.unique(array_, return_inverse=True)
>>> np.bincount(inverse)[inverse]
array([8, 8, 8, 2, 8, 8, 2, 8, 8, 2, 2, 8])
Another example, with larger array_
contents: 另一个例子,有更大的
array_
内容:
>>> array_ = np.array([0, 71, 598, 71, 0, 0, 243])
>>> _, inverse = np.unique(array_, return_inverse=True)
>>> inverse
array([0, 1, 3, 1, 0, 0, 2])
>>> np.bincount(inverse)[inverse]
array([3, 2, 1, 2, 3, 3, 1])
All of these solutions work in pure NumPy, so they should be significantly more efficient than a solution that goes via a Python Counter
or dict
. 所有这些解决方案都在纯NumPy中工作,因此它们应该比通过Python
Counter
或dict
的解决方案更有效。 As always, though, if efficiency is a concern then you should profile to find out what's most suitable. 但是,与往常一样,如果效率是一个问题,那么您应该剖析以找出最合适的效率。 Note in particular that
np.unique
is doing a sort under the hood, so its theoretical complexity is higher than that of the pure np.bincount
solution. 特别要注意的是,
np.unique
正在进行排序,因此其理论复杂度高于纯np.bincount
解决方案。 Whether that makes a difference in practice is impossible to say without timing. 如果没有时间安排,这在实践中是否有所作为是不可能的。 So let's do some timing, using IPython's
timeit
(this is on Python 3.4). 所以让我们做一些时间,使用IPython的
timeit
(这是在Python 3.4上)。 First we'll define functions for the operations we need: 首先,我们将为我们需要的操作定义函数:
In [1]: import numpy as np; from collections import Counter
In [2]: def freq_bincount(array):
...: return np.bincount(array)[array]
...:
In [3]: def freq_unique(array):
...: _, inverse, counts = np.unique(array, return_inverse=True, return_counts=True)
...: return counts[inverse]
...:
In [4]: def freq_counter(array):
...: c = Counter(array)
...: return np.array(list(map(c.get, array)))
...:
Now we create a test array: 现在我们创建一个测试数组:
In [5]: test_array = np.random.randint(100, size=10**6)
And then we do some timings. 然后我们做一些时间安排。 Here are the results on my machine:
以下是我机器上的结果:
In [6]: %timeit freq_bincount(test_array)
100 loops, best of 3: 2.69 ms per loop
In [7]: %timeit freq_unique(test_array)
10 loops, best of 3: 166 ms per loop
In [8]: %timeit freq_counter(test_array)
1 loops, best of 3: 317 ms per loop
There's an order-of-magnitude difference between the np.bincount
approach and the np.unique
approach. np.bincount
方法和np.unique
方法之间存在一个数量级的差异。 The Counter
approach from @Kasramvd's solution is somewhat slower than the np.unique
approach, but that could change on a different machine or with different versions of Python and NumPy: you should test with data that are appropriate for your use-case. 来自@ Kasramvd解决方案的
Counter
方法比np.unique
方法慢一些,但是可以在不同的机器上或使用不同版本的Python和NumPy进行更改:您应该使用适合您的用例的数据进行测试。
As a fast approach you can use colections.Counter
which is the more pythonic way for getting the frequency of an iterable items : 作为一种快速方法,您可以使用
colections.Counter
,这是获得可迭代项目频率的更加pythonic方式:
>>> import numpy as np
>>> array_ = np.array([0,0,0,1,0,0,2,0,0,1,2,0])
>>> from collections import Counter
>>> c=Counter(array_)
>>> np.array(map(c.get,array_))
array([8, 8, 8, 2, 8, 8, 2, 8, 8, 2, 2, 8])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.