简体   繁体   English

如何为 pandas boolean 掩码中的每个连续 True 值序列分配唯一的分组值

[英]How to assign unique grouping value for each sequence of consecutive True values in pandas boolean mask

I am trying to generate an appropriate pandas groupBy我正在尝试生成一个合适的 pandas groupBy

Say I have a boolean mask like so [false, false, true, false, true, true, false, true, true]假设我有一个像这样的 boolean 面具[false, false, true, false, true, true, false, true, true]

I would like the groupings to be like so [0,0,1,0,2,2,0,3,3]我希望分组像这样[0,0,1,0,2,2,0,3,3]

I can certainly create this array via a loop through the mask but I would like if possible to use the pandas or numpy builtins for ease of use and perhaps vectorization.我当然可以通过掩码循环创建这个数组,但如果可能的话,我想使用 pandas 或 numpy 内置函数,以便于使用和矢量化。

(If no builtin exists I would appreciate a more pythonic way of doing this than via a straight loop with a state flag and rank counter) (如果不存在内置函数,我会欣赏比通过带有 state 标志和排名计数器的直接循环更pythonic的方式)

Original Answer:原答案:

l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
pd.factorize(d)[0].tolist()

Slightly Modified Original Answer (works if first item is True):稍作修改的原始答案(如果第一项为真,则有效):

l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
dsort = d.sort_values()
dindex = dsort.index
pd.Series(pd.factorize(dsort)[0],index = dindex).sort_index().tolist()

Alternative way:替代方式:

Generate List and put into series.生成列表并放入系列。

l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)

Find items that are sequential查找顺序的项目

d = s.astype(int).diff().ne(0).cumsum().reset_index()

Locate the first True in each group在每组中找到第一个True

d.loc[s].groupby(0)['index'].first().rename_axis(None)

Factorize new grouping and put into series分解新分组并放入系列

f = pd.factorize(d.loc[s].groupby(0)['index'].first().rename_axis(None))
s2 = pd.Series(f[0]+1,index = f[1])

Use reindex and forward fill all the missing spaces.使用重新索引并向前填充所有缺失的空格。 Fill any NaN's with 0. Lastly replace all places that were False with zeros.用 0 填充任何 NaN。最后用零替换所有为False的地方。

s2 = s2.reindex(s.index).fillna(method='ffill').fillna(0)
s2.loc[~(s)] = 0
s2.tolist()

OP here, see the accepted answer.在这里,请参阅接受的答案。

I just spent sometime with it and wanted to say a word about how it works so hopefully it will come a bit easier to the next reader who wants to work it out.我只是花了一些时间来研究它,并想谈谈它是如何工作的,所以希望它对下一个想要解决它的读者来说会更容易一些。

Here is the code for reference:这是供参考的代码:

l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
pd.factorize(d)[0].tolist()

The key code pieces are the diff followed by the ne(0) followed by the cumsum()关键代码片段是diff后跟ne(0)后跟cumsum()

Here are the key insights.以下是关键见解。

diff : After the diff, the array will take on the following values: diff :在 diff 之后,数组将采用以下值:

The first value will take on NAN (because there is no preceding value to diff it with) and thereafter every True will take on a 1 if it follows a False or 0 if it follows another True .第一个值将采用NAN (因为没有前面的值来区分它),此后每个 True 如果跟随False将采用1 ,如果它跟随另一个True则采用 0 。

So all but one sequence of True values will look like this:因此,除了一个True值序列外,所有值都将如下所示:

1, 0, 0, 0

Or, when a sequence of True begins with the first element in the array或者,当True序列从数组中的第一个元素开始时

NAN, 0, 0, 0

(Nearly the same logic applies to the sequences of False values but we normalize those at the end with the d.loc[~(s)] = 0 statement) (几乎相同的逻辑适用于False值的序列,但我们在末尾使用d.loc[~(s)] = 0语句对它们进行规范化)

ne(0) normalizes NAN and 1 because they both != 0. ne(0)对 NAN 和 1 进行归一化,因为它们都!= 0。

cumsum() assigns a value (equal to one greater than the previous) to the first True in the sequence of True values that carries forward for all the other True values that are part of its sequence group (since their value from the diff call is 0 ). cumsum()True值序列中的第一个True分配一个值(等于大于前一个),该 True 值序列中的所有其他 True 值都是其序列组的一部分(因为它们来自diff调用的值是0 )。

So now we have what we want, with all sequences of True values in the series mapped to a unique integer.所以现在我们有了我们想要的,系列中的所有True值序列都映射到一个唯一的 integer。

Then we make the call to d.loc[~(s)] = 0 which assigns all the False values to group 0.然后我们调用d.loc[~(s)] = 0将所有False值分配给组 0。

We could stop here (without making the pd.factorize call) which would output [0, 0, 2, 0, 4, 4, 0, 6, 6] for the [False, False, True, False, True, True, False, True, True] I posited in the question, but to get the output to match the output I stipulated in the question, you need to call pd.factorize我们可以在这里停下来(不进行pd.factorize调用),这将为[0, 0, 2, 0, 4, 4, 0, 6, 6] [False, False, True, False, True, True, False, True, True]我在问题中提出,但要让 output 与我在问题中规定的 output 匹配,您需要调用pd.factorize

A word of caution for calling pd.factorize :调用pd.factorize的注意事项:

If the first value in the input array is True rather than False you will not have the False values mapping to the zero group, which in my use case recommended doing without making that call.如果输入数组中的第一个值是True而不是False ,那么您将不会将False值映射到零组,在我的用例中建议不要进行该调用。

If using skimage library is not an issue, You can do this:如果使用skimage库不是问题,您可以这样做:

from skimage import measure
l = [False, False, True, False, True, True, False, True, True]
labels = measure.label(np.array(l))
out: array([0, 0, 1, 0, 2, 2, 0, 3, 3], dtype=int32)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何为 pandas dataframe 中的重复列值序列分配唯一 ID? - How to assign a unique id for a sequence of repeated column value in pandas dataframe? 对唯一列值进行分组以获得 pandas dataframe 列中每个唯一值的平均值 - Grouping unique column values to get average of each unique value in pandas dataframe column 将唯一列值分组为 pandas dataframe 列中每个唯一值的总和 - Grouping unique column values to sum of each unique value in pandas dataframe column 为 pandas 中阈值以下的每个连续值分配唯一组 - Assign unique group per consecutive values under a threshold in pandas 为 pandas 中的每个子组分配唯一的数值 - Assign unique numeric value to each subgroup in pandas ValueError:NumPy 布尔数组索引赋值不能将 0 个输入值分配给掩码为真的 N 个输出值 - ValueError: NumPy boolean array indexing assignment cannot assign 0 input values to the N output values where the mask is true NumPy boolean 数组索引赋值不能将3个输入值赋给掩码为真的587028 output值 - NumPy boolean array indexing assignment cannot assign 3 input values to the 587028 output values where the mask is true 如何分配数据帧[布尔掩码] = 系列 - 使其按行排列? 即其中 Mask = true 从系列的同一行中取值 - How to assign dataframe[ boolean Mask] = Series - make it row-wise ? I.e. where Mask = true take values from the same row of the Series 在pandas groupby之后为组中的每个唯一值分配唯一ID - assign unique ID to each unique value in group after pandas groupby 如何在 python 中创建具有唯一值且没有连续序列的矩阵 nxn? - How to create matrix nxn with unique values and no consecutive sequence in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM