如何为 pandas boolean 掩码中的每个连续 True 值序列分配唯一的分组值

Question

I am trying to generate an appropriate pandas groupBy我正在尝试生成一个合适的 pandas groupBy

Say I have a boolean mask like so [false, false, true, false, true, true, false, true, true]假设我有一个像这样的 boolean 面具[false, false, true, false, true, true, false, true, true]

I would like the groupings to be like so [0,0,1,0,2,2,0,3,3]我希望分组像这样[0,0,1,0,2,2,0,3,3]

I can certainly create this array via a loop through the mask but I would like if possible to use the pandas or numpy builtins for ease of use and perhaps vectorization.我当然可以通过掩码循环创建这个数组，但如果可能的话，我想使用 pandas 或 numpy 内置函数，以便于使用和矢量化。

(If no builtin exists I would appreciate a more pythonic way of doing this than via a straight loop with a state flag and rank counter) （如果不存在内置函数，我会欣赏比通过带有 state 标志和排名计数器的直接循环更pythonic的方式）

Answer 1

Original Answer:原答案：

l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
pd.factorize(d)[0].tolist()

Slightly Modified Original Answer (works if first item is True):稍作修改的原始答案（如果第一项为真，则有效）：

l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
dsort = d.sort_values()
dindex = dsort.index
pd.Series(pd.factorize(dsort)[0],index = dindex).sort_index().tolist()

Alternative way:替代方式：

Generate List and put into series.生成列表并放入系列。

l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)

Find items that are sequential查找顺序的项目

d = s.astype(int).diff().ne(0).cumsum().reset_index()

Locate the first True in each group在每组中找到第一个True

d.loc[s].groupby(0)['index'].first().rename_axis(None)

Factorize new grouping and put into series分解新分组并放入系列

f = pd.factorize(d.loc[s].groupby(0)['index'].first().rename_axis(None))
s2 = pd.Series(f[0]+1,index = f[1])

Use reindex and forward fill all the missing spaces.使用重新索引并向前填充所有缺失的空格。 Fill any NaN's with 0. Lastly replace all places that were False with zeros.用 0 填充任何 NaN。最后用零替换所有为False的地方。

s2 = s2.reindex(s.index).fillna(method='ffill').fillna(0)
s2.loc[~(s)] = 0
s2.tolist()

Answer 2

OP here, see the accepted answer.在这里，请参阅接受的答案。

I just spent sometime with it and wanted to say a word about how it works so hopefully it will come a bit easier to the next reader who wants to work it out.我只是花了一些时间来研究它，并想谈谈它是如何工作的，所以希望它对下一个想要解决它的读者来说会更容易一些。

Here is the code for reference:这是供参考的代码：

l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
pd.factorize(d)[0].tolist()

The key code pieces are the diff followed by the ne(0) followed by the cumsum()关键代码片段是diff后跟ne(0)后跟cumsum()

Here are the key insights.以下是关键见解。

diff : After the diff, the array will take on the following values: diff ：在 diff 之后，数组将采用以下值：

The first value will take on NAN (because there is no preceding value to diff it with) and thereafter every True will take on a 1 if it follows a False or 0 if it follows another True .第一个值将采用NAN （因为没有前面的值来区分它），此后每个 True 如果跟随False将采用1 ，如果它跟随另一个True则采用 0 。

So all but one sequence of True values will look like this:因此，除了一个True值序列外，所有值都将如下所示：

1, 0, 0, 0

Or, when a sequence of True begins with the first element in the array或者，当True序列从数组中的第一个元素开始时

NAN, 0, 0, 0

(Nearly the same logic applies to the sequences of False values but we normalize those at the end with the d.loc[~(s)] = 0 statement) （几乎相同的逻辑适用于False值的序列，但我们在末尾使用d.loc[~(s)] = 0语句对它们进行规范化）

ne(0) normalizes NAN and 1 because they both != 0. ne(0)对 NAN 和 1 进行归一化，因为它们都!= 0。

cumsum() assigns a value (equal to one greater than the previous) to the first True in the sequence of True values that carries forward for all the other True values that are part of its sequence group (since their value from the diff call is 0 ). cumsum()为True值序列中的第一个True分配一个值（等于大于前一个），该 True 值序列中的所有其他 True 值都是其序列组的一部分（因为它们来自diff调用的值是0 )。

So now we have what we want, with all sequences of True values in the series mapped to a unique integer.所以现在我们有了我们想要的，系列中的所有True值序列都映射到一个唯一的 integer。

Then we make the call to d.loc[~(s)] = 0 which assigns all the False values to group 0.然后我们调用d.loc[~(s)] = 0将所有False值分配给组 0。

We could stop here (without making the pd.factorize call) which would output [0, 0, 2, 0, 4, 4, 0, 6, 6] for the [False, False, True, False, True, True, False, True, True] I posited in the question, but to get the output to match the output I stipulated in the question, you need to call pd.factorize我们可以在这里停下来（不进行pd.factorize调用），这将为[0, 0, 2, 0, 4, 4, 0, 6, 6] [False, False, True, False, True, True, False, True, True]我在问题中提出，但要让 output 与我在问题中规定的 output 匹配，您需要调用pd.factorize

A word of caution for calling pd.factorize :调用pd.factorize的注意事项：

If the first value in the input array is True rather than False you will not have the False values mapping to the zero group, which in my use case recommended doing without making that call.如果输入数组中的第一个值是True而不是False ，那么您将不会将False值映射到零组，在我的用例中建议不要进行该调用。

Answer 3

If using skimage library is not an issue, You can do this:如果使用skimage库不是问题，您可以这样做：

from skimage import measure
l = [False, False, True, False, True, True, False, True, True]
labels = measure.label(np.array(l))

out: array([0, 0, 1, 0, 2, 2, 0, 3, 3], dtype=int32)

如何为 pandas boolean 掩码中的每个连续 True 值序列分配唯一的分组值

问题描述

3 个解决方案

解决方案1
2 已采纳 2021-02-10 22:37:37

解决方案2
1 2021-02-11 00:57:59

解决方案3
0 2022-09-03 20:00:45

如何为 pandas boolean 掩码中的每个连续 True 值序列分配唯一的分组值

问题描述

3 个解决方案

解决方案1 2 已采纳 2021-02-10 22:37:37

解决方案2 1 2021-02-11 00:57:59

解决方案3 0 2022-09-03 20:00:45

解决方案1
2 已采纳 2021-02-10 22:37:37

解决方案2
1 2021-02-11 00:57:59

解决方案3
0 2022-09-03 20:00:45