Conditional length of a binary data series in Pandas

Question

Having a DataFrame with the following column:

df['A'] = [1,1,1,0,1,1,1,1,0,1]

What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:

Answer 1

One fully-vectorized solution is to use the shift - groupby - cumsum - cumcount combination ¹ to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:

df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
          .astype(int) # cast the boolean Series back to integers

This produces the new column in the DataFrame:

¹ See the pandas cookbook ; the section on grouping, "Grouping like Python's itertools.groupby"

Answer 2

Another way (checking if previous two are 1):

In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})

In [444]: limit = 2

In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))

In [446]: df
Out[446]: 
   A  B
0  1  1
1  1  1
2  1  0
3  0  0
4  1  1
5  1  1
6  1  0
7  1  0
8  0  0
9  1  1

Answer 3

If you know that the values in the series will all be either 0 or 1 , I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)

a = df['A'].as_matrix()

and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. Eg for a cutoff of 2, you would do

long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]

The resulting array, in this case, gives the number of 1 's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.

a[long_run_count > 2] = 0

You can now assign the resulting array to a new column in your DataFrame .

df['B'] = a

To turn this into a more general method:

def trim_runs(array, cutoff):
    a = numpy.asarray(array)
    a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
    return a

Conditional length of a binary data series in Pandas

Question

3 answers

solution1
3 ACCPTED 2016-08-28 09:28:21

solution2
2 2016-08-28 09:30:03

solution3
2 2016-08-28 09:45:33

Conditional length of a binary data series in Pandas

Question

3 answers

solution1 3 ACCPTED 2016-08-28 09:28:21

solution2 2 2016-08-28 09:30:03

solution3 2 2016-08-28 09:45:33

solution1
3 ACCPTED 2016-08-28 09:28:21

solution2
2 2016-08-28 09:30:03

solution3
2 2016-08-28 09:45:33