如何使用numpy在线性时间内按唯一值获取累计计数？

Question

Consider the following lists short_list and long_list 考虑以下列表short_list和long_list

short_list = list('aaabaaacaaadaaac')
np.random.seed([3,1415])
long_list = pd.DataFrame(
    np.random.choice(list(ascii_letters),
                     (10000, 2))
).sum(1).tolist()

How do I calculate the cumulative count by unique value? 如何通过唯一值计算累计计数？

I want to use numpy and do it in linear time. 我想使用numpy并在线性时间内完成。 I want this to compare timings with my other methods. 我希望这可以将计时与其他方法进行比较。 It may be easiest to illustrate with my first proposed solution 用我提出的第一个解决方案可能最容易说明

def pir1(l):
    s = pd.Series(l)
    return s.groupby(s).cumcount().tolist()

print(np.array(short_list))
print(pir1(short_list))

['a' 'a' 'a' 'b' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'd' 'a' 'a' 'a' 'c']
[0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]

I've tortured myself trying to use np.unique because it returns a counts array, an inverse array, and an index array. 我折磨自己尝试使用np.unique因为它返回一个counts数组，一个反向数组和一个索引数组。 I was sure I could these to get at a solution. 我确信我可以解决这些问题。 The best I got is in pir4 below which scales in quadratic time. 我得到的最好的是pir4低于pir4则是二次时间。 Also note that I don't care if counts start at 1 or zero as we can simply add or subtract 1. 另请注意 ，我不在乎计数是从1还是零开始，因为我们可以简单地加或减1.。

Below are some of my attempts (none of which answer my question) 以下是我的一些尝试（没有一个可以回答我的问题）

%%cython
from collections import defaultdict

def get_generator(l):
    counter = defaultdict(lambda: -1)
    for i in l:
        counter[i] += 1
        yield counter[i]

def pir2(l):
    return [i for i in get_generator(l)]

def pir3(l):
    return [i for i in get_generator(l)]

def pir4(l):
    unq, inv = np.unique(l, 0, 1, 0)
    a = np.arange(len(unq))
    matches = a[:, None] == inv
    return (matches * matches.cumsum(1)).sum(0).tolist()

Answer 1

Here's a vectorized approach using custom grouped range creating function and np.unique for getting the counts - 这是一种使用自定义分组范围创建函数和np.unique进行计数的矢量化方法-

def grp_range(a):
    idx = a.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -a[:-1]+1
    return id_arr.cumsum()

count = np.unique(A,return_counts=1)[1]
out = grp_range(count)[np.argsort(A).argsort()]

Sample run - 样品运行-

In [117]: A = list('aaabaaacaaadaaac')

In [118]: count = np.unique(A,return_counts=1)[1]
     ...: out = grp_range(count)[np.argsort(A).argsort()]
     ...: 

In [119]: out
Out[119]: array([ 0,  1,  2,  0,  3,  4,  5,  0,  6,  7,  8,  0,  9, 10, 11,  1])

For getting the count , few other alternatives could be proposed with focus on performance - 为了得到count ，其他一些替代办法，重点对性能提出-

np.bincount(np.unique(A,return_inverse=1)[1])
np.bincount(np.fromstring('aaabaaacaaadaaac',dtype=np.uint8)-97)

Additionally, with A containing single-letter characters, we could get the count simply with - 此外，对于包含single-letter字符的A ，我们可以简单地通过-

np.bincount(np.array(A).view('uint8')-97)

Answer 2

Besides defaultdict there are a couple of other counters. 除defaultdict还有其他几个计数器。 Testing a slightly simpler case: 测试一个稍微简单的案例：

In [298]: from collections import defaultdict
In [299]: from collections import defaultdict, Counter
In [300]: def foo(l):
     ...:     counter = defaultdict(int)
     ...:     for i in l:
     ...:         counter[i] += 1
     ...:     return counter
     ...: 
In [301]: short_list = list('aaabaaacaaadaaac')
In [302]: foo(short_list)
Out[302]: defaultdict(int, {'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [303]: Counter(short_list)
Out[303]: Counter({'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [304]: arr=[ord(i)-ord('a') for i in short_list]
In [305]: np.bincount(arr)
Out[305]: array([12,  1,  2,  1], dtype=int32)

I constructed arr because bincount only works with ints. 我构造了arr因为bincount仅适用于int。

In [306]: timeit np.bincount(arr)
The slowest run took 82.46 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.63 µs per loop
In [307]: timeit Counter(arr)
100000 loops, best of 3: 13.6 µs per loop
In [308]: timeit foo(arr)
100000 loops, best of 3: 6.49 µs per loop

I'm guessing it would hard to improve on pir2 based on default_dict. 我猜很难基于default_dict在pir2上进行改进。

Searching and counting like this are not a strong area for numpy . 像这样进行搜索和计数并不是numpy的强项。

Answer 3

setup 设定

short_list = np.array(list('aaabaaacaaadaaac'))

functions 职能

dfill takes an array and returns the positions where the array changes and repeats that index position until the next change. dfill获取一个数组，并返回该数组更改的位置，并重复该索引位置，直到下一次更改为止。
```
 # dfill # # Example with short_list # # 0 0 0 3 4 4 4 7 8 8 8 11 12 12 12 15 # [ aaabaaacaaadaaac] # # Example with short_list after sorting # # 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15 # [ aaaaaaaaaaaabccd] 
```
argunsort returns the permutation necessary to undo a sort given the argsort array. 给定argsort数组， argunsort返回撤消排序所argsort排列。 The existence of this method became know to me via this post. 通过这篇文章，我知道这种方法的存在。 . 。 With this, I can get the argsort array and sort my array with it. 有了这个，我可以得到argsort数组并对它进行排序。 Then I can undo the sort without the overhead of sorting again. 然后，我可以撤消排序，而无需再次进行排序。

cumcount will take an array sort it, find the dfill array. cumcount将一个数组排序，找到dfill数组。 An np.arange less dfill will give me cumulative count. 减少dfill的np.arange量将给我累计计数。 Then I un-sort 然后我取消排序

 # cumcount # # Example with short_list # # short_list: # [ aaabaaacaaadaaac] # # short_list.argsort(): # [ 0 1 2 4 5 6 8 9 10 12 13 14 3 7 15 11] # # Example with short_list after sorting # # short_list[short_list.argsort()]: # [ aaaaaaaaaaaabccd] # # dfill(short_list[short_list.argsort()]): # [ 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15] # # np.range(short_list.size): # [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] # # np.range(short_list.size) - # dfill(short_list[short_list.argsort()]): # [ 0 1 2 3 4 5 6 7 8 9 10 11 0 0 1 0] # # unsorted: # [ 0 1 2 0 3 4 5 0 6 7 8 0 9 10 11 1]

foo function recommended by @hpaulj using defaultdict @hpaulj使用defaultdict推荐的foo函数
div function recommended by @Divakar (old, I'm sure he'd update it) @Divakar推荐的div函数（旧的，我确定他会更新它）

code 码

def dfill(a):
    n = a.size
    b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0] + 1, [n]])
    return np.arange(n)[b[:-1]].repeat(np.diff(b))

def argunsort(s):
    n = s.size
    u = np.empty(n, dtype=np.int64)
    u[s] = np.arange(n)
    return u

def cumcount(a):
    n = a.size
    s = a.argsort(kind='mergesort')
    i = argunsort(s)
    b = a[s]
    return (np.arange(n) - dfill(b))[i]

def foo(l):
    n = len(l)
    r = np.empty(n, dtype=np.int64)
    counter = defaultdict(int)
    for i in range(n):
        counter[l[i]] += 1
        r[i] = counter[l[i]]
    return r - 1

def div(l):
    a = np.unique(l, return_counts=1)[1]
    idx = a.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -a[:-1]+1
    rng = id_arr.cumsum()
    return rng[argunsort(np.argsort(l))]

demonstration 示范

cumcount(short_list)

array([ 0,  1,  2,  0,  3,  4,  5,  0,  6,  7,  8,  0,  9, 10, 11,  1])

time testing 时间测试

code 码

functions = pd.Index(['cumcount', 'foo', 'foo2', 'div'], name='function')
lengths = pd.RangeIndex(100, 1100, 100, 'array length')
results = pd.DataFrame(index=lengths, columns=functions)

from string import ascii_letters

for i in lengths:
    a = np.random.choice(list(ascii_letters), i)
    for j in functions:
        results.set_value(
            i, j,
            timeit(
                '{}(a)'.format(j),
                'from __main__ import a, {}'.format(j),
                number=1000
            )
        )

results.plot()

如何使用numpy在线性时间内按唯一值获取累计计数？

问题描述

3 个解决方案

解决方案1
5 已采纳 2016-11-15 08:32:56

解决方案2
4 2016-11-15 05:16:20

解决方案3
4 2017-01-09 22:37:04

setup 设定

functions 职能

code 码

demonstration 示范

time testing 时间测试

code 码

如何使用numpy在线性时间内按唯一值获取累计计数？

问题描述

3 个解决方案

解决方案1 5 已采纳 2016-11-15 08:32:56

解决方案2 4 2016-11-15 05:16:20

解决方案3 4 2017-01-09 22:37:04

setup 设定

functions 职能

code 码

demonstration 示范

time testing 时间测试

code 码

解决方案1
5 已采纳 2016-11-15 08:32:56

解决方案2
4 2016-11-15 05:16:20

解决方案3
4 2017-01-09 22:37:04