[英]How to assign unique grouping value for each sequence of consecutive True values in pandas boolean mask
I am trying to generate an appropriate pandas groupBy我正在尝试生成一个合适的 pandas groupBy
Say I have a boolean mask like so [false, false, true, false, true, true, false, true, true]
假设我有一个像这样的 boolean 面具
[false, false, true, false, true, true, false, true, true]
I would like the groupings to be like so [0,0,1,0,2,2,0,3,3]
我希望分组像这样
[0,0,1,0,2,2,0,3,3]
I can certainly create this array via a loop through the mask but I would like if possible to use the pandas or numpy builtins for ease of use and perhaps vectorization.我当然可以通过掩码循环创建这个数组,但如果可能的话,我想使用 pandas 或 numpy 内置函数,以便于使用和矢量化。
(If no builtin exists I would appreciate a more pythonic way of doing this than via a straight loop with a state flag and rank counter) (如果不存在内置函数,我会欣赏比通过带有 state 标志和排名计数器的直接循环更pythonic的方式)
Original Answer:原答案:
l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
pd.factorize(d)[0].tolist()
Slightly Modified Original Answer (works if first item is True):稍作修改的原始答案(如果第一项为真,则有效):
l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
dsort = d.sort_values()
dindex = dsort.index
pd.Series(pd.factorize(dsort)[0],index = dindex).sort_index().tolist()
Alternative way:替代方式:
Generate List and put into series.生成列表并放入系列。
l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
Find items that are sequential查找顺序的项目
d = s.astype(int).diff().ne(0).cumsum().reset_index()
Locate the first True
in each group在每组中找到第一个
True
d.loc[s].groupby(0)['index'].first().rename_axis(None)
Factorize new grouping and put into series分解新分组并放入系列
f = pd.factorize(d.loc[s].groupby(0)['index'].first().rename_axis(None))
s2 = pd.Series(f[0]+1,index = f[1])
Use reindex and forward fill all the missing spaces.使用重新索引并向前填充所有缺失的空格。 Fill any NaN's with 0. Lastly replace all places that were
False
with zeros.用 0 填充任何 NaN。最后用零替换所有为
False
的地方。
s2 = s2.reindex(s.index).fillna(method='ffill').fillna(0)
s2.loc[~(s)] = 0
s2.tolist()
OP here, see the accepted answer.在这里,请参阅接受的答案。
I just spent sometime with it and wanted to say a word about how it works so hopefully it will come a bit easier to the next reader who wants to work it out.我只是花了一些时间来研究它,并想谈谈它是如何工作的,所以希望它对下一个想要解决它的读者来说会更容易一些。
Here is the code for reference:这是供参考的代码:
l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
pd.factorize(d)[0].tolist()
The key code pieces are the diff
followed by the ne(0)
followed by the cumsum()
关键代码片段是
diff
后跟ne(0)
后跟cumsum()
Here are the key insights.以下是关键见解。
diff
: After the diff, the array will take on the following values: diff
:在 diff 之后,数组将采用以下值:
The first value will take on NAN
(because there is no preceding value to diff it with) and thereafter every True will take on a 1
if it follows a False
or 0 if it follows another True
.第一个值将采用
NAN
(因为没有前面的值来区分它),此后每个 True 如果跟随False
将采用1
,如果它跟随另一个True
则采用 0 。
So all but one sequence of True
values will look like this:因此,除了一个
True
值序列外,所有值都将如下所示:
1, 0, 0, 0
Or, when a sequence of True
begins with the first element in the array或者,当
True
序列从数组中的第一个元素开始时
NAN, 0, 0, 0
(Nearly the same logic applies to the sequences of False
values but we normalize those at the end with the d.loc[~(s)] = 0
statement) (几乎相同的逻辑适用于
False
值的序列,但我们在末尾使用d.loc[~(s)] = 0
语句对它们进行规范化)
ne(0)
normalizes NAN and 1 because they both !=
0. ne(0)
对 NAN 和 1 进行归一化,因为它们都!=
0。
cumsum()
assigns a value (equal to one greater than the previous) to the first True
in the sequence of True
values that carries forward for all the other True values that are part of its sequence group (since their value from the diff
call is 0
). cumsum()
为True
值序列中的第一个True
分配一个值(等于大于前一个),该 True 值序列中的所有其他 True 值都是其序列组的一部分(因为它们来自diff
调用的值是0
)。
So now we have what we want, with all sequences of True
values in the series mapped to a unique integer.所以现在我们有了我们想要的,系列中的所有
True
值序列都映射到一个唯一的 integer。
Then we make the call to d.loc[~(s)] = 0
which assigns all the False
values to group 0.然后我们调用
d.loc[~(s)] = 0
将所有False
值分配给组 0。
We could stop here (without making the pd.factorize
call) which would output [0, 0, 2, 0, 4, 4, 0, 6, 6]
for the [False, False, True, False, True, True, False, True, True]
I posited in the question, but to get the output to match the output I stipulated in the question, you need to call pd.factorize
我们可以在这里停下来(不进行
pd.factorize
调用),这将为[0, 0, 2, 0, 4, 4, 0, 6, 6]
[False, False, True, False, True, True, False, True, True]
我在问题中提出,但要让 output 与我在问题中规定的 output 匹配,您需要调用pd.factorize
A word of caution for calling pd.factorize
:调用
pd.factorize
的注意事项:
If the first value in the input array is True
rather than False
you will not have the False
values mapping to the zero group, which in my use case recommended doing without making that call.如果输入数组中的第一个值是
True
而不是False
,那么您将不会将False
值映射到零组,在我的用例中建议不要进行该调用。
If using skimage
library is not an issue, You can do this:如果使用
skimage
库不是问题,您可以这样做:
from skimage import measure
l = [False, False, True, False, True, True, False, True, True]
labels = measure.label(np.array(l))
out: array([0, 0, 1, 0, 2, 2, 0, 3, 3], dtype=int32)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.