[英]How to use numpy to get the cumulative count by unique values in linear time?
Consider the following lists short_list
and long_list
考虑以下列表
short_list
和long_list
short_list = list('aaabaaacaaadaaac')
np.random.seed([3,1415])
long_list = pd.DataFrame(
np.random.choice(list(ascii_letters),
(10000, 2))
).sum(1).tolist()
How do I calculate the cumulative count by unique value? 如何通过唯一值计算累计计数?
I want to use numpy and do it in linear time. 我想使用numpy并在线性时间内完成。 I want this to compare timings with my other methods.
我希望这可以将计时与其他方法进行比较。 It may be easiest to illustrate with my first proposed solution
用我提出的第一个解决方案可能最容易说明
def pir1(l):
s = pd.Series(l)
return s.groupby(s).cumcount().tolist()
print(np.array(short_list))
print(pir1(short_list))
['a' 'a' 'a' 'b' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'd' 'a' 'a' 'a' 'c']
[0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]
I've tortured myself trying to use np.unique
because it returns a counts array, an inverse array, and an index array. 我折磨自己尝试使用
np.unique
因为它返回一个counts数组,一个反向数组和一个索引数组。 I was sure I could these to get at a solution. 我确信我可以解决这些问题。 The best I got is in
pir4
below which scales in quadratic time. 我得到的最好的是
pir4
低于pir4
则是二次时间。 Also note that I don't care if counts start at 1 or zero as we can simply add or subtract 1. 另请注意 ,我不在乎计数是从1还是零开始,因为我们可以简单地加或减1.。
Below are some of my attempts (none of which answer my question) 以下是我的一些尝试(没有一个可以回答我的问题)
%%cython
from collections import defaultdict
def get_generator(l):
counter = defaultdict(lambda: -1)
for i in l:
counter[i] += 1
yield counter[i]
def pir2(l):
return [i for i in get_generator(l)]
def pir3(l):
return [i for i in get_generator(l)]
def pir4(l):
unq, inv = np.unique(l, 0, 1, 0)
a = np.arange(len(unq))
matches = a[:, None] == inv
return (matches * matches.cumsum(1)).sum(0).tolist()
Here's a vectorized approach using custom grouped range creating function and np.unique
for getting the counts - 这是一种使用自定义分组范围创建函数和
np.unique
进行计数的矢量化方法-
def grp_range(a):
idx = a.cumsum()
id_arr = np.ones(idx[-1],dtype=int)
id_arr[0] = 0
id_arr[idx[:-1]] = -a[:-1]+1
return id_arr.cumsum()
count = np.unique(A,return_counts=1)[1]
out = grp_range(count)[np.argsort(A).argsort()]
Sample run - 样品运行-
In [117]: A = list('aaabaaacaaadaaac')
In [118]: count = np.unique(A,return_counts=1)[1]
...: out = grp_range(count)[np.argsort(A).argsort()]
...:
In [119]: out
Out[119]: array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1])
For getting the count
, few other alternatives could be proposed with focus on performance - 为了得到
count
,其他一些替代办法,重点对性能提出-
np.bincount(np.unique(A,return_inverse=1)[1])
np.bincount(np.fromstring('aaabaaacaaadaaac',dtype=np.uint8)-97)
Additionally, with A
containing single-letter
characters, we could get the count simply with - 此外,对于包含
single-letter
字符的A
,我们可以简单地通过-
np.bincount(np.array(A).view('uint8')-97)
Besides defaultdict
there are a couple of other counters. 除
defaultdict
还有其他几个计数器。 Testing a slightly simpler case: 测试一个稍微简单的案例:
In [298]: from collections import defaultdict
In [299]: from collections import defaultdict, Counter
In [300]: def foo(l):
...: counter = defaultdict(int)
...: for i in l:
...: counter[i] += 1
...: return counter
...:
In [301]: short_list = list('aaabaaacaaadaaac')
In [302]: foo(short_list)
Out[302]: defaultdict(int, {'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [303]: Counter(short_list)
Out[303]: Counter({'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [304]: arr=[ord(i)-ord('a') for i in short_list]
In [305]: np.bincount(arr)
Out[305]: array([12, 1, 2, 1], dtype=int32)
I constructed arr
because bincount
only works with ints. 我构造了
arr
因为bincount
仅适用于int。
In [306]: timeit np.bincount(arr)
The slowest run took 82.46 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.63 µs per loop
In [307]: timeit Counter(arr)
100000 loops, best of 3: 13.6 µs per loop
In [308]: timeit foo(arr)
100000 loops, best of 3: 6.49 µs per loop
I'm guessing it would hard to improve on pir2
based on default_dict. 我猜很难基于default_dict在
pir2
上进行改进。
Searching and counting like this are not a strong area for numpy
. 像这样进行搜索和计数并不是
numpy
的强项。
short_list = np.array(list('aaabaaacaaadaaac'))
dfill
takes an array and returns the positions where the array changes and repeats that index position until the next change. dfill
获取一个数组,并返回该数组更改的位置,并重复该索引位置,直到下一次更改为止。
# dfill # # Example with short_list # # 0 0 0 3 4 4 4 7 8 8 8 11 12 12 12 15 # [ aaabaaacaaadaaac] # # Example with short_list after sorting # # 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15 # [ aaaaaaaaaaaabccd]
argunsort
returns the permutation necessary to undo a sort given the argsort
array. argsort
数组, argunsort
返回撤消排序所argsort
排列。 The existence of this method became know to me via this post. argsort
array and sort my array with it. argsort
数组并对它进行排序。 Then I can undo the sort without the overhead of sorting again. cumcount
will take an array sort it, find the dfill
array. cumcount
将一个数组排序,找到dfill
数组。 An np.arange
less dfill
will give me cumulative count. 减少
dfill
的np.arange
量将给我累计计数。 Then I un-sort 然后我取消排序
# cumcount # # Example with short_list # # short_list: # [ aaabaaacaaadaaac] # # short_list.argsort(): # [ 0 1 2 4 5 6 8 9 10 12 13 14 3 7 15 11] # # Example with short_list after sorting # # short_list[short_list.argsort()]: # [ aaaaaaaaaaaabccd] # # dfill(short_list[short_list.argsort()]): # [ 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15] # # np.range(short_list.size): # [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] # # np.range(short_list.size) - # dfill(short_list[short_list.argsort()]): # [ 0 1 2 3 4 5 6 7 8 9 10 11 0 0 1 0] # # unsorted: # [ 0 1 2 0 3 4 5 0 6 7 8 0 9 10 11 1]
foo
function recommended by @hpaulj using defaultdict
defaultdict
推荐的foo
函数 div
function recommended by @Divakar (old, I'm sure he'd update it) div
函数(旧的,我确定他会更新它) def dfill(a):
n = a.size
b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0] + 1, [n]])
return np.arange(n)[b[:-1]].repeat(np.diff(b))
def argunsort(s):
n = s.size
u = np.empty(n, dtype=np.int64)
u[s] = np.arange(n)
return u
def cumcount(a):
n = a.size
s = a.argsort(kind='mergesort')
i = argunsort(s)
b = a[s]
return (np.arange(n) - dfill(b))[i]
def foo(l):
n = len(l)
r = np.empty(n, dtype=np.int64)
counter = defaultdict(int)
for i in range(n):
counter[l[i]] += 1
r[i] = counter[l[i]]
return r - 1
def div(l):
a = np.unique(l, return_counts=1)[1]
idx = a.cumsum()
id_arr = np.ones(idx[-1],dtype=int)
id_arr[0] = 0
id_arr[idx[:-1]] = -a[:-1]+1
rng = id_arr.cumsum()
return rng[argunsort(np.argsort(l))]
cumcount(short_list)
array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1])
functions = pd.Index(['cumcount', 'foo', 'foo2', 'div'], name='function')
lengths = pd.RangeIndex(100, 1100, 100, 'array length')
results = pd.DataFrame(index=lengths, columns=functions)
from string import ascii_letters
for i in lengths:
a = np.random.choice(list(ascii_letters), i)
for j in functions:
results.set_value(
i, j,
timeit(
'{}(a)'.format(j),
'from __main__ import a, {}'.format(j),
number=1000
)
)
results.plot()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.