简体   繁体   English

熊猫中二进制数据序列的条件长度

[英]Conditional length of a binary data series in Pandas

Having a DataFrame with the following column: 具有带有以下列的DataFrame:

df['A'] = [1,1,1,0,1,1,1,1,0,1]

What would be the best vectorized way to control the length of "1"-series by some limiting value? 通过某个极限值控制“ 1”系列长度的最佳矢量化方法是什么? Let's say the limit is 2, then the resulting column 'B' must look like: 假设限制为2,则结果列“ B”必须看起来像:

   A  B
0  1  1
1  1  1
2  1  0
3  0  0
4  1  1
5  1  1
6  1  0
7  1  0
8  0  0
9  1  1

One fully-vectorized solution is to use the shift - groupby - cumsum - cumcount combination 1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). 一种完全矢量化的解决方案是使用shift - groupby - cumsum - cumcount组合1来指示连续cumcount短于2的位置(或您喜欢的任何限制值)。 Then, & this new boolean Series with the original column: 然后, &这种新的布尔系列与原列:

df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
          .astype(int) # cast the boolean Series back to integers

This produces the new column in the DataFrame: 这将在DataFrame中产生新列:

   A  B
0  1  1
1  1  1
2  1  0
3  0  0
4  1  1
5  1  1
6  1  0
7  1  0
8  0  0
9  1  1

1 See the pandas cookbook ; 1参见《 熊猫食谱》 the section on grouping, "Grouping like Python's itertools.groupby" 关于分组的部分,“像Python的itertools.groupby一样进行分组”

Another way (checking if previous two are 1): 另一种方法(检查前两个是否为1):

In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})

In [444]: limit = 2

In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))

In [446]: df
Out[446]: 
   A  B
0  1  1
1  1  1
2  1  0
3  0  0
4  1  1
5  1  1
6  1  0
7  1  0
8  0  0
9  1  1

If you know that the values in the series will all be either 0 or 1 , I think you can use a little trick involving convolution. 如果您知道该系列中的值都为01 ,那么我想您可以使用涉及卷积的小技巧。 Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array) 复制列(不必是Pandas对象,它可以只是普通的Numpy数组)

a = df['A'].as_matrix()

and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. 并将其与比所需cutoff长度长一的1序列进行卷积,然后切掉最后一个cutoff元素。 Eg for a cutoff of 2, you would do 例如, cutoff值为2,您会这样做

long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]

The resulting array, in this case, gives the number of 1 's that occur in the 3 elements prior to and including that element. 在这种情况下,所得数组给出在该元素之前(包括该元素)的3个元素中出现的1的数目。 If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero. 如果该数字为3,则说明您正在运行的长度超过了2。因此,只需将这些元素设置为零即可。

a[long_run_count > 2] = 0

You can now assign the resulting array to a new column in your DataFrame . 现在,您可以将结果数组分配给DataFrame的新列。

df['B'] = a

To turn this into a more general method: 要将其转换为更通用的方法:

def trim_runs(array, cutoff):
    a = numpy.asarray(array)
    a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
    return a

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM