R group_by() + rleid() 在 Python 中等效

Question

I've got a following data frame in Python:我在 Python 中有以下数据框：

df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
                         'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]), 
                                                np.repeat([1, 2, 3], [2, 2, 2])]),
                         'obj': list('AB' * 6),
                         'var': [1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1]})

First, within each group defined by object , I'd like to assign id to unique run of measurement_id and var columns.首先，确定各组内object ，我想给ID，以独特的运行measurement_id和var列。 If any value of those columns changes, it starts new run that should be assigned with new id.如果这些列的任何值发生变化，它就会开始新的运行，应该分配新的 id。 So the所以

df['rleid_output'] = [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 3]

Then, for each group defined by rleid_output I'd like to check how many minutes ( min column) the run lasted giving me expected_output column:然后，对于rleid_output定义的每个组，我想检查运行持续了多少分钟（ min列）给我expected_output列：

df['expected_output'] = [2, 2, 2, 2, 1, 1, 2, 3, 2, 3, 1, 3]

If it was R, I'd proceed as follows:如果是 R，我会按以下步骤进行：

df <- data.frame(measurement_id = rep(1:2, each = 6),
           min = rep(rep(1:3, each = 2), 2),
           object = rep(LETTERS[1:2], 6),
           var = c(1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1))
df %>% 
  group_by(object) %>% 
  mutate(rleid = data.table::rleid(measurement_id, var)) %>% 
  group_by(object, rleid) %>% 
  mutate(expected_output = last(min) - first(min) + 1)

So the main thing I need is R data.table::rleid equivalent that would work with Python pd.DataFrame.groupby clause.所以我需要的主要是 R data.table::rleid等价物，它可以与 Python pd.DataFrame.groupby子句一起使用。 Any ideas how to solve this?任何想法如何解决这个问题？

@Edit: new, updated example of data frame: @Edit：新的、更新的数据框示例：

df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
                         'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]), 
                                                np.repeat([1, 2, 3], [2, 2, 2])]),
                         'obj': list('AB' * 6),
                         'var': [1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1]})
df['rleid_output'] = [1, 1, 2, 1, 3, 2, 4, 3, 4, 3, 5, 3]
df['expected_output'] = [1, 2, 1, 2, 1, 1, 2, 3, 2, 3, 1, 3]

Answer 1

Updated answer更新答案

The problem is that the min column in each group of measurement_id, obj, var should be maintained order.问题是每组measurement_id, obj, var中的min列要保持顺序。 We can check this by group by on measurement_id, obj, var and then checking if the difference in min column is greater than 1 .我们可以在measurement_id, obj, var上按组检查这一点，然后检查min列中的差异是否大于1 。 If so, we mark it as a unique duration in expected_output :如果是这样，我们将其标记为expected_output的唯一持续时间：

df['grouper'] = (df.groupby(['measurement_id', 'obj', 'var'])['min']
                 .apply(lambda x: x.diff().fillna(1).eq(1))
                )

df['expected_output'] = (
    df.groupby(['measurement_id', 'obj', 'var'])['grouper'].transform('sum').astype(int)
)

df = df.drop(columns='grouper')

    measurement_id  min obj  var  expected_output
0                1    1   A    1                1
1                1    1   B    2                2
2                1    2   A    2                1
3                1    2   B    2                2
4                1    3   A    1                1
5                1    3   B    1                1
6                2    1   A    2                2
7                2    1   B    1                3
8                2    2   A    2                2
9                2    2   B    1                3
10               2    3   A    1                1
11               2    3   B    1                3

Old answer, following OP's logic旧答案，遵循 OP 的逻辑

We can achieve this by using GroupBy.diff to get your rleid_output , basically a unique identifier each time var changes for each measurement_id & obj我们可以通过实现这个GroupBy.diff让你rleid_output ，基本上是一个唯一的标识符每次var每个变化measurement_id ＆ obj

After that using GroupBy.nunique to measure the amount of minutes :之后使用GroupBy.nunique来测量minutes ：

rleid_output = df.groupby(['measurement_id', 'obj'])['var'].diff().abs().bfill()
df['expected_output'] = (df.groupby(['measurement_id', 'obj', rleid_output])['min']
                         .transform('nunique'))

    measurement_id  min obj  var  expected_output
0                1    1   A    1                2
1                1    1   B    2                2
2                1    2   A    1                2
3                1    2   B    2                2
4                1    3   A    2                1
5                1    3   B    1                1
6                2    1   A    2                2
7                2    1   B    1                3
8                2    2   A    2                2
9                2    2   B    1                3
10               2    3   A    1                1
11               2    3   B    1                3

Answer 2

To mimic the behaviour of R rleid function one can firstly create an artificial column that checks if current value has changed in comparison to previous one or not.为了模仿 R rleid函数的行为，可以首先创建一个人工列，检查当前值与前一个值相比是否发生了变化。 In this case, we should do this on grouped var Series:在这种情况下，我们应该在分组var系列上执行此操作：

var_grpd = df.groupby(['measurement_id', 'obj'])['var']
df['tmp'] = (var_grpd.shift(0) != var_grpd.shift(1))

Then, we can use this artificial tmp column to obtain rleid_output2 .然后，我们可以使用这个人工tmp列来获取rleid_output2 。 After that, tmp column is no longer needed.之后，不再需要tmp列。

df['rleid_output2'] = df.groupby('obj')['tmp'].cumsum().astype(int)
df.drop('tmp', axis = 1, inplace = True)

Finally, to check how many minutes var value lasts, we can calculate the difference between last and first minute within a group.最后，为了检查var value 持续了多少分钟，我们可以计算一组内最后一分钟和第一分钟之间的差异。

df['expected_output2'] = df.groupby(['obj', 'rleid_output2'])['min'] \
                           .transform(lambda x: x.iat[-1] - x.iat[0] + 1)

.iat is similar to .iloc but allows us to access single value in DataFrame or Series . .iat类似于.iloc但允许我们访问DataFrame或Series单个值。

R group_by() + rleid() 在 Python 中等效

问题描述

2 个解决方案

解决方案1
3 2019-12-27 13:49:04

解决方案2
1 2019-12-27 20:38:04

R group_by() + rleid() 在 Python 中等效

问题描述

2 个解决方案

解决方案1 3 2019-12-27 13:49:04

解决方案2 1 2019-12-27 20:38:04

解决方案1
3 2019-12-27 13:49:04

解决方案2
1 2019-12-27 20:38:04