简体   繁体   中英

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:

df['A'] = [1,1,1,0,1,1,1,1,0,1]

What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:

   A  B
0  1  1
1  1  1
2  1  0
3  0  0
4  1  1
5  1  1
6  1  0
7  1  0
8  0  0
9  1  1

One fully-vectorized solution is to use the shift - groupby - cumsum - cumcount combination 1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:

df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
          .astype(int) # cast the boolean Series back to integers

This produces the new column in the DataFrame:

   A  B
0  1  1
1  1  1
2  1  0
3  0  0
4  1  1
5  1  1
6  1  0
7  1  0
8  0  0
9  1  1

1 See the pandas cookbook ; the section on grouping, "Grouping like Python's itertools.groupby"

Another way (checking if previous two are 1):

In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})

In [444]: limit = 2

In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))

In [446]: df
Out[446]: 
   A  B
0  1  1
1  1  1
2  1  0
3  0  0
4  1  1
5  1  1
6  1  0
7  1  0
8  0  0
9  1  1

If you know that the values in the series will all be either 0 or 1 , I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)

a = df['A'].as_matrix()

and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. Eg for a cutoff of 2, you would do

long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]

The resulting array, in this case, gives the number of 1 's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.

a[long_run_count > 2] = 0

You can now assign the resulting array to a new column in your DataFrame .

df['B'] = a

To turn this into a more general method:

def trim_runs(array, cutoff):
    a = numpy.asarray(array)
    a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
    return a

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM