How to use numpy to get the cumulative count by unique values in linear time?

Question

Consider the following lists short_list and long_list

short_list = list('aaabaaacaaadaaac')
np.random.seed([3,1415])
long_list = pd.DataFrame(
    np.random.choice(list(ascii_letters),
                     (10000, 2))
).sum(1).tolist()

How do I calculate the cumulative count by unique value?

I want to use numpy and do it in linear time. I want this to compare timings with my other methods. It may be easiest to illustrate with my first proposed solution

def pir1(l):
    s = pd.Series(l)
    return s.groupby(s).cumcount().tolist()

print(np.array(short_list))
print(pir1(short_list))

['a' 'a' 'a' 'b' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'd' 'a' 'a' 'a' 'c']
[0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]

I've tortured myself trying to use np.unique because it returns a counts array, an inverse array, and an index array. I was sure I could these to get at a solution. The best I got is in pir4 below which scales in quadratic time. Also note that I don't care if counts start at 1 or zero as we can simply add or subtract 1.

Below are some of my attempts (none of which answer my question)

%%cython
from collections import defaultdict

def get_generator(l):
    counter = defaultdict(lambda: -1)
    for i in l:
        counter[i] += 1
        yield counter[i]

def pir2(l):
    return [i for i in get_generator(l)]

def pir3(l):
    return [i for i in get_generator(l)]

def pir4(l):
    unq, inv = np.unique(l, 0, 1, 0)
    a = np.arange(len(unq))
    matches = a[:, None] == inv
    return (matches * matches.cumsum(1)).sum(0).tolist()

Answer 1

Here's a vectorized approach using custom grouped range creating function and np.unique for getting the counts -

def grp_range(a):
    idx = a.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -a[:-1]+1
    return id_arr.cumsum()

count = np.unique(A,return_counts=1)[1]
out = grp_range(count)[np.argsort(A).argsort()]

Sample run -

In [117]: A = list('aaabaaacaaadaaac')

In [118]: count = np.unique(A,return_counts=1)[1]
     ...: out = grp_range(count)[np.argsort(A).argsort()]
     ...: 

In [119]: out
Out[119]: array([ 0,  1,  2,  0,  3,  4,  5,  0,  6,  7,  8,  0,  9, 10, 11,  1])

For getting the count , few other alternatives could be proposed with focus on performance -

np.bincount(np.unique(A,return_inverse=1)[1])
np.bincount(np.fromstring('aaabaaacaaadaaac',dtype=np.uint8)-97)

Additionally, with A containing single-letter characters, we could get the count simply with -

np.bincount(np.array(A).view('uint8')-97)

Answer 2

Besides defaultdict there are a couple of other counters. Testing a slightly simpler case:

In [298]: from collections import defaultdict
In [299]: from collections import defaultdict, Counter
In [300]: def foo(l):
     ...:     counter = defaultdict(int)
     ...:     for i in l:
     ...:         counter[i] += 1
     ...:     return counter
     ...: 
In [301]: short_list = list('aaabaaacaaadaaac')
In [302]: foo(short_list)
Out[302]: defaultdict(int, {'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [303]: Counter(short_list)
Out[303]: Counter({'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [304]: arr=[ord(i)-ord('a') for i in short_list]
In [305]: np.bincount(arr)
Out[305]: array([12,  1,  2,  1], dtype=int32)

I constructed arr because bincount only works with ints.

In [306]: timeit np.bincount(arr)
The slowest run took 82.46 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.63 µs per loop
In [307]: timeit Counter(arr)
100000 loops, best of 3: 13.6 µs per loop
In [308]: timeit foo(arr)
100000 loops, best of 3: 6.49 µs per loop

I'm guessing it would hard to improve on pir2 based on default_dict.

Searching and counting like this are not a strong area for numpy .

Answer 3

setup

short_list = np.array(list('aaabaaacaaadaaac'))

functions

dfill takes an array and returns the positions where the array changes and repeats that index position until the next change.

 # dfill # # Example with short_list # # 0 0 0 3 4 4 4 7 8 8 8 11 12 12 12 15 # [ aaabaaacaaadaaac] # # Example with short_list after sorting # # 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15 # [ aaaaaaaaaaaabccd]

argunsort returns the permutation necessary to undo a sort given the argsort array. The existence of this method became know to me via this post. . With this, I can get the argsort array and sort my array with it. Then I can undo the sort without the overhead of sorting again.

cumcount will take an array sort it, find the dfill array. An np.arange less dfill will give me cumulative count. Then I un-sort

 # cumcount # # Example with short_list # # short_list: # [ aaabaaacaaadaaac] # # short_list.argsort(): # [ 0 1 2 4 5 6 8 9 10 12 13 14 3 7 15 11] # # Example with short_list after sorting # # short_list[short_list.argsort()]: # [ aaaaaaaaaaaabccd] # # dfill(short_list[short_list.argsort()]): # [ 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15] # # np.range(short_list.size): # [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] # # np.range(short_list.size) - # dfill(short_list[short_list.argsort()]): # [ 0 1 2 3 4 5 6 7 8 9 10 11 0 0 1 0] # # unsorted: # [ 0 1 2 0 3 4 5 0 6 7 8 0 9 10 11 1]

foo function recommended by @hpaulj using defaultdict
div function recommended by @Divakar (old, I'm sure he'd update it)

code

def dfill(a):
    n = a.size
    b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0] + 1, [n]])
    return np.arange(n)[b[:-1]].repeat(np.diff(b))

def argunsort(s):
    n = s.size
    u = np.empty(n, dtype=np.int64)
    u[s] = np.arange(n)
    return u

def cumcount(a):
    n = a.size
    s = a.argsort(kind='mergesort')
    i = argunsort(s)
    b = a[s]
    return (np.arange(n) - dfill(b))[i]

def foo(l):
    n = len(l)
    r = np.empty(n, dtype=np.int64)
    counter = defaultdict(int)
    for i in range(n):
        counter[l[i]] += 1
        r[i] = counter[l[i]]
    return r - 1

def div(l):
    a = np.unique(l, return_counts=1)[1]
    idx = a.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -a[:-1]+1
    rng = id_arr.cumsum()
    return rng[argunsort(np.argsort(l))]

demonstration

cumcount(short_list)

array([ 0,  1,  2,  0,  3,  4,  5,  0,  6,  7,  8,  0,  9, 10, 11,  1])

time testing

code

functions = pd.Index(['cumcount', 'foo', 'foo2', 'div'], name='function')
lengths = pd.RangeIndex(100, 1100, 100, 'array length')
results = pd.DataFrame(index=lengths, columns=functions)

from string import ascii_letters

for i in lengths:
    a = np.random.choice(list(ascii_letters), i)
    for j in functions:
        results.set_value(
            i, j,
            timeit(
                '{}(a)'.format(j),
                'from __main__ import a, {}'.format(j),
                number=1000
            )
        )

results.plot()

How to use numpy to get the cumulative count by unique values in linear time?

Question

3 answers

solution1
5 ACCPTED 2016-11-15 08:32:56

solution2
4 2016-11-15 05:16:20

solution3
4 2017-01-09 22:37:04

setup

functions

code

demonstration

time testing

code

How to use numpy to get the cumulative count by unique values in linear time?

Question

3 answers

solution1 5 ACCPTED 2016-11-15 08:32:56

solution2 4 2016-11-15 05:16:20

solution3 4 2017-01-09 22:37:04

setup

functions

code

demonstration

time testing

code

solution1
5 ACCPTED 2016-11-15 08:32:56

solution2
4 2016-11-15 05:16:20

solution3
4 2017-01-09 22:37:04