[英]R group_by() + rleid() equivalent in Python
I've got a following data frame in Python:我在 Python 中有以下数据框:
df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]),
np.repeat([1, 2, 3], [2, 2, 2])]),
'obj': list('AB' * 6),
'var': [1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1]})
First, within each group defined by object
, I'd like to assign id to unique run of measurement_id
and var
columns.首先,确定各组内object
,我想给ID,以独特的运行measurement_id
和var
列。 If any value of those columns changes, it starts new run that should be assigned with new id.如果这些列的任何值发生变化,它就会开始新的运行,应该分配新的 id。 So the所以
df['rleid_output'] = [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 3]
Then, for each group defined by rleid_output
I'd like to check how many minutes ( min
column) the run lasted giving me expected_output
column:然后,对于rleid_output
定义的每个组,我想检查运行持续了多少分钟( min
列)给我expected_output
列:
df['expected_output'] = [2, 2, 2, 2, 1, 1, 2, 3, 2, 3, 1, 3]
If it was R, I'd proceed as follows:如果是 R,我会按以下步骤进行:
df <- data.frame(measurement_id = rep(1:2, each = 6),
min = rep(rep(1:3, each = 2), 2),
object = rep(LETTERS[1:2], 6),
var = c(1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1))
df %>%
group_by(object) %>%
mutate(rleid = data.table::rleid(measurement_id, var)) %>%
group_by(object, rleid) %>%
mutate(expected_output = last(min) - first(min) + 1)
So the main thing I need is R data.table::rleid
equivalent that would work with Python pd.DataFrame.groupby
clause.所以我需要的主要是 R data.table::rleid
等价物,它可以与 Python pd.DataFrame.groupby
子句一起使用。 Any ideas how to solve this?任何想法如何解决这个问题?
@Edit: new, updated example of data frame: @Edit:新的、更新的数据框示例:
df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]),
np.repeat([1, 2, 3], [2, 2, 2])]),
'obj': list('AB' * 6),
'var': [1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1]})
df['rleid_output'] = [1, 1, 2, 1, 3, 2, 4, 3, 4, 3, 5, 3]
df['expected_output'] = [1, 2, 1, 2, 1, 1, 2, 3, 2, 3, 1, 3]
Updated answer更新答案
The problem is that the min
column in each group of measurement_id, obj, var
should be maintained order.问题是每组measurement_id, obj, var
中的min
列要保持顺序。 We can check this by group by on measurement_id, obj, var
and then checking if the difference in min
column is greater than 1
.我们可以在measurement_id, obj, var
上按组检查这一点,然后检查min
列中的差异是否大于1
。 If so, we mark it as a unique duration in expected_output
:如果是这样,我们将其标记为expected_output
的唯一持续时间:
df['grouper'] = (df.groupby(['measurement_id', 'obj', 'var'])['min']
.apply(lambda x: x.diff().fillna(1).eq(1))
)
df['expected_output'] = (
df.groupby(['measurement_id', 'obj', 'var'])['grouper'].transform('sum').astype(int)
)
df = df.drop(columns='grouper')
measurement_id min obj var expected_output
0 1 1 A 1 1
1 1 1 B 2 2
2 1 2 A 2 1
3 1 2 B 2 2
4 1 3 A 1 1
5 1 3 B 1 1
6 2 1 A 2 2
7 2 1 B 1 3
8 2 2 A 2 2
9 2 2 B 1 3
10 2 3 A 1 1
11 2 3 B 1 3
Old answer, following OP's logic旧答案,遵循 OP 的逻辑
We can achieve this by using GroupBy.diff
to get your rleid_output
, basically a unique identifier each time var
changes for each measurement_id
& obj
我们可以通过实现这个GroupBy.diff
让你rleid_output
,基本上是一个唯一的标识符每次var
每个变化measurement_id
& obj
After that using GroupBy.nunique
to measure the amount of minutes
:之后使用GroupBy.nunique
来测量minutes
:
rleid_output = df.groupby(['measurement_id', 'obj'])['var'].diff().abs().bfill()
df['expected_output'] = (df.groupby(['measurement_id', 'obj', rleid_output])['min']
.transform('nunique'))
measurement_id min obj var expected_output
0 1 1 A 1 2
1 1 1 B 2 2
2 1 2 A 1 2
3 1 2 B 2 2
4 1 3 A 2 1
5 1 3 B 1 1
6 2 1 A 2 2
7 2 1 B 1 3
8 2 2 A 2 2
9 2 2 B 1 3
10 2 3 A 1 1
11 2 3 B 1 3
To mimic the behaviour of R rleid
function one can firstly create an artificial column that checks if current value has changed in comparison to previous one or not.为了模仿 R rleid
函数的行为,可以首先创建一个人工列,检查当前值与前一个值相比是否发生了变化。 In this case, we should do this on grouped var
Series:在这种情况下,我们应该在分组var
系列上执行此操作:
var_grpd = df.groupby(['measurement_id', 'obj'])['var']
df['tmp'] = (var_grpd.shift(0) != var_grpd.shift(1))
Then, we can use this artificial tmp
column to obtain rleid_output2
.然后,我们可以使用这个人工tmp
列来获取rleid_output2
。 After that, tmp
column is no longer needed.之后,不再需要tmp
列。
df['rleid_output2'] = df.groupby('obj')['tmp'].cumsum().astype(int)
df.drop('tmp', axis = 1, inplace = True)
Finally, to check how many minutes var
value lasts, we can calculate the difference between last and first minute within a group.最后,为了检查var
value 持续了多少分钟,我们可以计算一组内最后一分钟和第一分钟之间的差异。
df['expected_output2'] = df.groupby(['obj', 'rleid_output2'])['min'] \
.transform(lambda x: x.iat[-1] - x.iat[0] + 1)
.iat
is similar to .iloc
but allows us to access single value in DataFrame
or Series
. .iat
类似于.iloc
但允许我们访问DataFrame
或Series
单个值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.