简体   繁体   English

如何根据条件过滤熊猫系列值

[英]How to filter pandas series values based on a condition

I have a pandas series as pd.Series([-1, -1, -1, 0, 0, 0, -5, -5, 0, 0, 0, -1, -1, -1 , -1]) . 我有一个pd.Series([-1, -1, -1, 0, 0, 0, -5, -5, 0, 0, 0, -1, -1, -1 , -1])的熊猫系列pd.Series([-1, -1, -1, 0, 0, 0, -5, -5, 0, 0, 0, -1, -1, -1 , -1]) How can I convert it in to pd.Series([-1, 0, 0, 0, -5, -5, 0, 0, 0, -1]) . 如何将其转换为pd.Series([-1, 0, 0, 0, -5, -5, 0, 0, 0, -1])

The condition to filter is that if -1 s are more than or equal to 3 in a streak, then keep the first occurrence and discard the rest. 要过滤的条件是,如果-1 s在连胜中大于或等于3 ,则保留第一个出现并丢弃其余出现。

Since the first -1 s streak is 3 , we keep -1 and discard the rest. 由于前-1 s的条纹是3 ,因此我们保留-1并丢弃其余的。 After the first 3 values, the streak breaks (since the value is now 0 ). 在前3值之后,条纹消失(因为该值现在为0 )。 Similarly the last -1 s streak is 4 , so we keep the -1 and discard the rest. 同样,最后-1 s的条纹是4 ,因此我们保留-1并丢弃其余的。

The filter only applies to -1 and -5 should be left as is 该过滤器仅适用于-1-5应该保留原样

Thanks 谢谢

PS: I thought about groupby, but I think it doesnt honor the streak way that I described above PS:我考虑过groupby,但是我认为它不符合我上面描述的streak方式

With conditional mask: 有条件的面具:

In [43]: s = pd.Series([-1, -1, -1, 0, 0, 0, -5, -5, 0, 0, 0, -1, -1, -1 , -1])                                         

In [44]: m = (s.diff() == 0) & (s.eq(-1))                                                                               

In [45]: s[~m]                                                                                                          
Out[45]: 
0    -1
3     0
4     0
5     0
6    -5
7    -5
8     0
9     0
10    0
11   -1
dtype: int64

With some SciPy tools - 借助一些SciPy工具-

from scipy.ndimage.morphology import binary_opening,binary_erosion

def keep_first_neg1s(s, W=3):
    k1 = np.ones(W,dtype=bool)
    k2 = np.ones(2,dtype=bool)
    m = s==-1
    return s[~binary_erosion(binary_opening(m,k1),k2) | ~m]

A simpler one and hopefully more performant too - 一个更简单的方法,希望它也可以提高性能-

def keep_first_neg1s_v2(s, W=3):
    m1 = binary_opening(a==-1, np.ones(W,dtype=bool))
    return s[np.r_[True,~m1[:-1]]]

Runs on given sample s - 在给定样本s上运行-

# Using .tolist() simply for better visualization
In [47]: s.tolist()
Out[47]: [-1, -1, -1, 0, 0, 0, -5, -5, 0, 0, 0, -1, -1, -1, -1]

In [48]: keep_first_neg1s(s,W=3).tolist()
Out[48]: [-1, 0, 0, 0, -5, -5, 0, 0, 0, -1]

In [49]: keep_first_neg1s(s,W=4).tolist()
Out[49]: [-1, -1, -1, 0, 0, 0, -5, -5, 0, 0, 0, -1]

IIUC, pandas masking and groupby: IIUC,熊猫遮罩和分组方式:

def remove_streaks(T):
  '''T is the threshold
  '''

  g = s.groupby(s.diff().ne(0).cumsum() + s.ne(-1).cumsum())
  mask = g.transform('size').lt(T).cumsum() + s.diff().ne(0).cumsum() 

  return s.groupby(mask).first()

>>> remove_streaks(4)
[-1, -1, -1, 0, 0, 0, -5, -5, 0, 0, 0, -1]

>>> remove_streaks(3)
[-1, 0, 0, 0, -5, -5, 0, 0, 0, -1]

Create a boolean mask m to identify positions where values change. 创建布尔掩码m以标识值更改的位置。 Groupby s on m.cumsum() with transform to identify groups having number of -1 < 3 and assign it to mask m1 . Groupby S于m.cumsum()与所述变换以识别具有的数量的组-1 <3并指定为掩模m1 Boolean m or m1 and cumsum to separate only groups-with-number -1 >= 3 into the same number. 布尔值m or m1和cumsum仅将数字-1 > = 3的组分隔为相同的数字。 Finally, use duplicated to slice. 最后,使用duplicated切片。

m = s.diff().ne(0)
m1 = s.groupby(m.cumsum()).transform(lambda x: x.eq(-1).sum() < 3)
m2 = ~((m | m1).cumsum().duplicated())
s[m2]

Step by step : 逐步
I modify your sample to include case -1 have 2 consecutive rows which we should keep. 我将样本修改为包括案例-1 ,我们应保留连续的2行。

s
Out[148]:
0    -1
1    -1
2    -1
3     0
4    -1
5    -1
6     0
7     0
8    -5
9    -5
10    0
11    0
12    0
13   -1
14   -1
15   -1
16   -1
dtype: int64

m = s.diff().ne(0)

Out[150]:
0      True
1     False
2     False
3      True
4      True
5     False
6      True
7     False
8      True
9     False
10     True
11    False
12    False
13     True
14    False
15    False
16    False
dtype: bool

m1 = s.groupby(m.cumsum()).transform(lambda x: x.eq(-1).sum() < 3)

Out[152]:
0     False
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13    False
14    False
15    False
16    False
dtype: bool

m2 = ~((m | m1).cumsum().duplicated())

Out[159]:
0      True
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14    False
15    False
16    False
dtype: bool

In [168]: s[m2]
Out[168]:
0    -1
3     0
4    -1
5    -1
6     0
7     0
8    -5
9    -5
10    0
11    0
12    0
13   -1
dtype: int64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM