熊猫中二进制数据序列的条件长度

Question

Having a DataFrame with the following column: 具有带有以下列的DataFrame：

df['A'] = [1,1,1,0,1,1,1,1,0,1]

What would be the best vectorized way to control the length of "1"-series by some limiting value? 通过某个极限值控制“ 1”系列长度的最佳矢量化方法是什么？ Let's say the limit is 2, then the resulting column 'B' must look like: 假设限制为2，则结果列“ B”必须看起来像：

Answer 1

One fully-vectorized solution is to use the shift - groupby - cumsum - cumcount combination ¹ to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). 一种完全矢量化的解决方案是使用shift - groupby - cumsum - cumcount组合¹来指示连续cumcount短于2的位置（或您喜欢的任何限制值）。 Then, & this new boolean Series with the original column: 然后， &这种新的布尔系列与原列：

df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
          .astype(int) # cast the boolean Series back to integers

This produces the new column in the DataFrame: 这将在DataFrame中产生新列：

¹ See the pandas cookbook ; ¹参见《熊猫食谱》； the section on grouping, "Grouping like Python's itertools.groupby" 关于分组的部分，“像Python的itertools.groupby一样进行分组”

Answer 2

Another way (checking if previous two are 1): 另一种方法（检查前两个是否为1）：

In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})

In [444]: limit = 2

In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))

In [446]: df
Out[446]: 
   A  B
0  1  1
1  1  1
2  1  0
3  0  0
4  1  1
5  1  1
6  1  0
7  1  0
8  0  0
9  1  1

Answer 3

If you know that the values in the series will all be either 0 or 1 , I think you can use a little trick involving convolution. 如果您知道该系列中的值都为0或1 ，那么我想您可以使用涉及卷积的小技巧。 Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array) 复制列（不必是Pandas对象，它可以只是普通的Numpy数组）

a = df['A'].as_matrix()

and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. 并将其与比所需cutoff长度长一的1序列进行卷积，然后切掉最后一个cutoff元素。 Eg for a cutoff of 2, you would do 例如， cutoff值为2，您会这样做

long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]

The resulting array, in this case, gives the number of 1 's that occur in the 3 elements prior to and including that element. 在这种情况下，所得数组给出在该元素之前（包括该元素）的3个元素中出现的1的数目。 If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero. 如果该数字为3，则说明您正在运行的长度超过了2。因此，只需将这些元素设置为零即可。

a[long_run_count > 2] = 0

You can now assign the resulting array to a new column in your DataFrame . 现在，您可以将结果数组分配给DataFrame的新列。

df['B'] = a

To turn this into a more general method: 要将其转换为更通用的方法：

def trim_runs(array, cutoff):
    a = numpy.asarray(array)
    a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
    return a

熊猫中二进制数据序列的条件长度

问题描述

3 个解决方案

解决方案1
3 已采纳 2016-08-28 09:28:21

解决方案2
2 2016-08-28 09:30:03

解决方案3
2 2016-08-28 09:45:33

熊猫中二进制数据序列的条件长度

问题描述

3 个解决方案

解决方案1 3 已采纳 2016-08-28 09:28:21

解决方案2 2 2016-08-28 09:30:03

解决方案3 2 2016-08-28 09:45:33

解决方案1
3 已采纳 2016-08-28 09:28:21

解决方案2
2 2016-08-28 09:30:03

解决方案3
2 2016-08-28 09:45:33