简体   繁体   中英

R group_by() + rleid() equivalent in Python

I've got a following data frame in Python:

df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
                         'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]), 
                                                np.repeat([1, 2, 3], [2, 2, 2])]),
                         'obj': list('AB' * 6),
                         'var': [1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1]})

First, within each group defined by object , I'd like to assign id to unique run of measurement_id and var columns. If any value of those columns changes, it starts new run that should be assigned with new id. So the

df['rleid_output'] = [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 3]

Then, for each group defined by rleid_output I'd like to check how many minutes ( min column) the run lasted giving me expected_output column:

df['expected_output'] = [2, 2, 2, 2, 1, 1, 2, 3, 2, 3, 1, 3]

If it was R, I'd proceed as follows:

df <- data.frame(measurement_id = rep(1:2, each = 6),
           min = rep(rep(1:3, each = 2), 2),
           object = rep(LETTERS[1:2], 6),
           var = c(1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1))
df %>% 
  group_by(object) %>% 
  mutate(rleid = data.table::rleid(measurement_id, var)) %>% 
  group_by(object, rleid) %>% 
  mutate(expected_output = last(min) - first(min) + 1) 

So the main thing I need is R data.table::rleid equivalent that would work with Python pd.DataFrame.groupby clause. Any ideas how to solve this?

@Edit: new, updated example of data frame:

df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
                         'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]), 
                                                np.repeat([1, 2, 3], [2, 2, 2])]),
                         'obj': list('AB' * 6),
                         'var': [1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1]})
df['rleid_output'] = [1, 1, 2, 1, 3, 2, 4, 3, 4, 3, 5, 3]
df['expected_output'] = [1, 2, 1, 2, 1, 1, 2, 3, 2, 3, 1, 3]

Updated answer

The problem is that the min column in each group of measurement_id, obj, var should be maintained order. We can check this by group by on measurement_id, obj, var and then checking if the difference in min column is greater than 1 . If so, we mark it as a unique duration in expected_output :

df['grouper'] = (df.groupby(['measurement_id', 'obj', 'var'])['min']
                 .apply(lambda x: x.diff().fillna(1).eq(1))
                )

df['expected_output'] = (
    df.groupby(['measurement_id', 'obj', 'var'])['grouper'].transform('sum').astype(int)
)

df = df.drop(columns='grouper')

    measurement_id  min obj  var  expected_output
0                1    1   A    1                1
1                1    1   B    2                2
2                1    2   A    2                1
3                1    2   B    2                2
4                1    3   A    1                1
5                1    3   B    1                1
6                2    1   A    2                2
7                2    1   B    1                3
8                2    2   A    2                2
9                2    2   B    1                3
10               2    3   A    1                1
11               2    3   B    1                3

Old answer, following OP's logic

We can achieve this by using GroupBy.diff to get your rleid_output , basically a unique identifier each time var changes for each measurement_id & obj

After that using GroupBy.nunique to measure the amount of minutes :

rleid_output = df.groupby(['measurement_id', 'obj'])['var'].diff().abs().bfill()
df['expected_output'] = (df.groupby(['measurement_id', 'obj', rleid_output])['min']
                         .transform('nunique'))

    measurement_id  min obj  var  expected_output
0                1    1   A    1                2
1                1    1   B    2                2
2                1    2   A    1                2
3                1    2   B    2                2
4                1    3   A    2                1
5                1    3   B    1                1
6                2    1   A    2                2
7                2    1   B    1                3
8                2    2   A    2                2
9                2    2   B    1                3
10               2    3   A    1                1
11               2    3   B    1                3

To mimic the behaviour of R rleid function one can firstly create an artificial column that checks if current value has changed in comparison to previous one or not. In this case, we should do this on grouped var Series:

var_grpd = df.groupby(['measurement_id', 'obj'])['var']
df['tmp'] = (var_grpd.shift(0) != var_grpd.shift(1))

Then, we can use this artificial tmp column to obtain rleid_output2 . After that, tmp column is no longer needed.

df['rleid_output2'] = df.groupby('obj')['tmp'].cumsum().astype(int)
df.drop('tmp', axis = 1, inplace = True)

Finally, to check how many minutes var value lasts, we can calculate the difference between last and first minute within a group.

df['expected_output2'] = df.groupby(['obj', 'rleid_output2'])['min'] \
                           .transform(lambda x: x.iat[-1] - x.iat[0] + 1)

.iat is similar to .iloc but allows us to access single value in DataFrame or Series .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM