简体   繁体   English

numpy最快的方法将数组的元素转换为它们的频率

[英]numpy fastest way to transform an array's elements to their frequency

As the title said, I am looking for a way to transform an array so it will be the array of frequency of its proper elements. 正如标题所说,我正在寻找一种方法来转换数组,因此它将是其适当元素的频率数组。

I found np.count and np.histogram but it's not what I am looking for 我找到了np.countnp.histogram但它不是我想要的

Something like: 就像是:

From: 从:

array_ = np.array([0,0,0,1,0,0,2,0,0,1,2,0])

To: 至:

array_ = np.array([8,8,8,2,8,8,2,8,8,2,2,8])

Thanks in advance! 提前致谢!

If the values in your array are nonnegative integers which aren't too large, you can use np.bincount . 如果数组中的值是非负整数,而不是太大,则可以使用np.bincount Using your original array as an index into the bincount result gives your desired output. 使用原始数组作为bincount结果的索引, bincount得到所需的输出。

>>> array_ = np.array([0,0,0,1,0,0,2,0,0,1,2,0])
>>> np.bincount(array_)
array([8, 2, 2])
>>> np.bincount(array_)[array_]
array([8, 8, 8, 2, 8, 8, 2, 8, 8, 2, 2, 8])

Bear in mind that the result of np.bincount has size max(array_) + 1 , so if your array has large values this approach is inefficient: you end up creating a very large intermediate result. 请记住, np.bincount的结果的大小为max(array_) + 1 ,所以如果你的数组值很大,这种方法效率很低:你最终会创建一个非常大的中间结果。

An alternative approach that should be efficient even with large or negative inputs is to use np.unique with the return_inverse and return_counts arguments, as follows: 即使对于大输入或负输入也应该有效的替代方法是将np.uniquereturn_inversereturn_counts参数一起使用,如下所示:

>>> array_ = np.array([0,0,0,1,0,0,2,0,0,1,2,0])
>>> _, inv, counts = np.unique(array_, return_inverse=True, return_counts=True)
>>> counts[inv]
array([8, 8, 8, 2, 8, 8, 2, 8, 8, 2, 2, 8])

Note that the return_counts argument is new in NumPy 1.9.0, so you'll need an up-to-date version of NumPy. 请注意, return_counts参数是NumPy 1.9.0中的新参数,因此您需要一个NumPy的最新版本。 If you don't have NumPy 1.9.0, all is not lost! 如果你没有NumPy 1.9.0,一切都不会丢失! You can still use the return_inverse argument of np.unique , which gives you back an array of small integers in the same arrangement as your original one. 你仍然可以使用return_inverse的说法np.unique ,它给你回小整数的相同排列的数组作为您的原始之一。 That new array is now in perfect shape for bincount to work on it efficiently: 这个新阵列现在处于完美状态,以便bincount地处理它:

>>> array_ = np.array([0,0,0,1,0,0,2,0,0,1,2,0])
>>> _, inverse = np.unique(array_, return_inverse=True)
>>> np.bincount(inverse)[inverse]
array([8, 8, 8, 2, 8, 8, 2, 8, 8, 2, 2, 8])

Another example, with larger array_ contents: 另一个例子,有更大的array_内容:

>>> array_ = np.array([0, 71, 598, 71, 0, 0, 243])
>>> _, inverse = np.unique(array_, return_inverse=True)
>>> inverse
array([0, 1, 3, 1, 0, 0, 2])
>>> np.bincount(inverse)[inverse]
array([3, 2, 1, 2, 3, 3, 1])

All of these solutions work in pure NumPy, so they should be significantly more efficient than a solution that goes via a Python Counter or dict . 所有这些解决方案都在纯NumPy中工作,因此它们应该比通过Python Counterdict的解决方案更有效。 As always, though, if efficiency is a concern then you should profile to find out what's most suitable. 但是,与往常一样,如果效率是一个问题,那么您应该剖析以找出最合适的效率。 Note in particular that np.unique is doing a sort under the hood, so its theoretical complexity is higher than that of the pure np.bincount solution. 特别要注意的是, np.unique正在进行排序,因此其理论复杂度高于纯np.bincount解决方案。 Whether that makes a difference in practice is impossible to say without timing. 如果没有时间安排,这在实践中是否有所作为是不可能的。 So let's do some timing, using IPython's timeit (this is on Python 3.4). 所以让我们做一些时间,使用IPython的timeit (这是在Python 3.4上)。 First we'll define functions for the operations we need: 首先,我们将为我们需要的操作定义函数:

In [1]: import numpy as np; from collections import Counter

In [2]: def freq_bincount(array):
   ...:     return np.bincount(array)[array]
   ...: 

In [3]: def freq_unique(array):
   ...:     _, inverse, counts = np.unique(array, return_inverse=True, return_counts=True)
   ...:     return counts[inverse]
   ...: 

In [4]: def freq_counter(array):
   ...:     c = Counter(array)
   ...:     return np.array(list(map(c.get, array)))
   ...: 

Now we create a test array: 现在我们创建一个测试数组:

In [5]: test_array = np.random.randint(100, size=10**6)

And then we do some timings. 然后我们做一些时间安排。 Here are the results on my machine: 以下是我机器上的结果:

In [6]: %timeit freq_bincount(test_array)
100 loops, best of 3: 2.69 ms per loop

In [7]: %timeit freq_unique(test_array)
10 loops, best of 3: 166 ms per loop

In [8]: %timeit freq_counter(test_array)
1 loops, best of 3: 317 ms per loop

There's an order-of-magnitude difference between the np.bincount approach and the np.unique approach. np.bincount方法和np.unique方法之间存在一个数量级的差异。 The Counter approach from @Kasramvd's solution is somewhat slower than the np.unique approach, but that could change on a different machine or with different versions of Python and NumPy: you should test with data that are appropriate for your use-case. 来自@ Kasramvd解决方案的Counter方法比np.unique方法慢一些,但是可以在不同的机器上或使用不同版本的Python和NumPy进行更改:您应该使用适合您的用例的数据进行测试。

As a fast approach you can use colections.Counter which is the more pythonic way for getting the frequency of an iterable items : 作为一种快速方法,您可以使用colections.Counter ,这是获得可迭代项目频率的更加pythonic方式:

>>> import numpy as np
>>> array_ = np.array([0,0,0,1,0,0,2,0,0,1,2,0])
>>> from collections import Counter
>>> c=Counter(array_)
>>> np.array(map(c.get,array_))
array([8, 8, 8, 2, 8, 8, 2, 8, 8, 2, 2, 8])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM