Finding unique sets without subsets in python array

Question

I have a dataset that needs to output boolean style data, just 1 and 0, for true or not true. I am trying to parse simple data sets I've processed to look for a subset of information in a numpy array, the array is about 100,000 elements in one direction and 20 in the other. I only need to search along the 20 axis, but I need to do that for each of the 100,000 entries and get output that I can map.

I've produced an array of this size made up of zeros, with the intention to simply mark the matching index indicator to a 1. A main hitch is that if I find a long set (I'm working with long sets to small sets), I need to NOT include any smaller set that's within it.

Sample: [0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,1,0,1]

I need to find here that there are 1 group of 5, starting at index 2, and 1 group of 3, starting at index 9, and not return any subset of the group of 5 as though it were a group of 4 or a group of 3, thus leaving the results for all those already covered values. ie for groups of 3, the indices 2, 3, 4, 5, and 6 would all remain zero. It doesn't need to be overly efficient, I don't care if it searches anyways, I just need to not keep the result.

Currently I'm using a codeblock basically like this for a simple search:

values = numpy.array([0,1,1,1,1,1,0,0,1,1,1])
searchval = [1,2]
N = len(searchval)
possibles = numpy.where(values == searchval[0])[0]
print(possibles)
solns = []
for p in possibles:
    check = values[p:p+N]
    if numpy.all(check == searchval):
        solns.append(p)
print(solns)

I've been wracking my brain trying to come up with a way to restructure this or similar code to produce the desires results. The end goal is to be searching for groups of 9 down to groups of 3, and having effectively a matrix of 1s and 0s indicating if an index has a group starting on it that is as long as we want.

Hopefully someone can point me to what I'm missing to make this work. Thanks!

Answer 1

Something like this?

from collections import defaultdict

sample = [0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1]

# Keys are number of consecutive 1's, values are indicies
results = defaultdict(list)
found = 0

for i, x in enumerate(samples):
    if x == 1:
        found += 1
    elif i == 0 or found == 0:
        continue
    else:
        results[found].append(i - found)
        found = 0

if found:
    results[found].append(i - found + 1)

assert results == {1: [15, 17], 3: [9], 5: [2]}

Answer 2

Using more_itertools , a third-party library ( pip install more_itertools ):

import more_itertools as mit


sample = [0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1]

groups = [list(c) for c in mit.consecutive_groups((mit.locate(sample)))]
d = {group[0]: len(group) for group in groups}
d
# {2: 5, 9: 3, 15: 1, 17: 1}

This result reads "At index 2 is a group of 5 ones. At group 9 is a group of 3 ones," etc.

Details

more_itertools.locate finds indices for truthy items by default.
more_itertools.consecutive_groups chunks consecutive numbers together.
The result is a dictionary of (starting-index, length) pairs.

As a dictionary , you can extract different kinds of information:

>>> # List of starting indices
>>> list(d)
[2, 9, 15, 17]

>>> # List indices for all lonely groups
>>> [k for k, v in d.items() if v == 1]
[15, 17]

>>> # List indices of groups greater the 2 items
>>> [k for k, v in d.items() if v > 1]
[2, 9]

Answer 3

Here is a numpy solution. I'm using a small example for demonstration but it easily scales ( 20 x 100,000 takes 25 ms on my rather modest laptop, see timings at the end of this post):

>>> import numpy as np
>>> 
>>> 
>>> a = np.random.randint(0, 2, (5, 10), dtype=np.int8)
>>> a
array([[0, 1, 0, 0, 1, 1, 0, 0, 0, 0],
       [0, 1, 1, 0, 1, 0, 1, 0, 0, 0],
       [1, 0, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 1, 1, 1, 1, 0, 0]], dtype=int8)
>>> 
>>> padded = np.pad(a,((1,1),(0,0)), 'constant')
# compare array to itself with offset to mark all switches from
# 0 to 1 or from 1 to 0
# then use 'where' to extract the coordinates
>>> colinds, rowinds = np.where((padded[:-1] != padded[1:]).T)
>>> 
# the lengths of sets are the differences between switch points
>>> lengths = rowinds[1::2] - rowinds[::2]
# now we have the lengths we are free to throw the off-switches away
>>> colinds, rowinds = colinds[::2], rowinds[::2]
>>> 
# admire
>>> from pprint import pprint
>>> pprint(list(zip(colinds, rowinds, lengths)))
[(0, 2, 1),
 (1, 0, 2),
 (2, 1, 2),
 (2, 4, 1),
 (3, 2, 1),
 (4, 0, 5),
 (5, 0, 1),
 (5, 2, 1),
 (5, 4, 1),
 (6, 1, 1),
 (6, 3, 2),
 (7, 4, 1)]

Timings:

>>> def find_stretches(a):
...     padded = np.pad(a,((1,1),(0,0)), 'constant')
...     colinds, rowinds = np.where((padded[:-1] != padded[1:]).T)
...     lengths = rowinds[1::2] - rowinds[::2]
...     colinds, rowinds = colinds[::2], rowinds[::2]
...     return colinds, rowinds, lengths
... 
>>> a = np.random.randint(0, 2, (20, 100000), dtype=np.int8)
>>> from timeit import repeat
>>> kwds = dict(globals=globals(), number=100)
>>> repeat('find_stretches(a)', **kwds)
[2.475784719004878, 2.4715258619980887, 2.4705517270049313]

Finding unique sets without subsets in python array

Question

3 answers

solution1
0 2018-02-07 23:12:48

solution2
0 ACCPTED 2018-02-08 02:59:33

solution3
0 2018-02-08 03:03:35

Finding unique sets without subsets in python array

Question

3 answers

solution1 0 2018-02-07 23:12:48

solution2 0 ACCPTED 2018-02-08 02:59:33

solution3 0 2018-02-08 03:03:35

solution1
0 2018-02-07 23:12:48

solution2
0 ACCPTED 2018-02-08 02:59:33

solution3
0 2018-02-08 03:03:35