[英]Python pandas equivalent to R's group_by, mutate, and ifelse
Probably a duplicate, but I have spent too much time on this now googling without any luck.可能是重复的,但是我现在花了太多时间在谷歌上搜索,但没有任何运气。 Assume I have a data frame:
假设我有一个数据框:
import pandas as pd
data = {"letters": ["a", "a", "a", "b", "b", "b"],
"boolean": [True, True, True, True, True, False],
"numbers": [1, 2, 3, 1, 2, 3]}
df = pd.DataFrame(data)
df
I want to 1) group by letters, 2) take the mean of numbers if all values in boolean have the same value.我想 1)按字母分组,2)如果 boolean 中的所有值都具有相同的值,则取数字的平均值。 In R I would write:
在 R 我会写:
library(dplyr)
df %>%
group_by(letters) %>%
mutate(
condition = n_distinct(boolean) == 1,
numbers = ifelse(condition, mean(numbers), numbers)
) %>%
select(-condition)
This would result in the following output:这将导致以下 output:
# A tibble: 6 x 3
# Groups: letters [2]
letters boolean numbers
<chr> <lgl> <dbl>
1 a TRUE 2
2 a TRUE 2
3 a TRUE 2
4 b TRUE 1
5 b TRUE 2
6 b FALSE 3
How would you do it using Python pandas?您将如何使用 Python pandas 来做到这一点?
We can use lazy groupby
and transform
:我们可以使用惰性
groupby
和transform
:
g = df.groupby('letters')
df.loc[g['boolean'].transform('all'), 'numbers'] = g['numbers'].transform('mean')
Output: Output:
letters boolean numbers
0 a True 2
1 a True 2
2 a True 2
3 b True 1
4 b True 2
5 b False 3
Another way would be to use np.where.另一种方法是使用 np.where。 where a group has one unique value, find mean.
如果一个组有一个唯一的值,求均值。 Where it doesnt keep the numbers.
它不保留数字的地方。 Code below
下面的代码
df['numbers'] =np.where(df.groupby('letters')['boolean'].transform('nunique')==1,df.groupby('letters')['numbers'].transform('mean'), df['numbers'])
letters boolean numbers
0 a True 2.0
1 a True 2.0
2 a True 2.0
3 b True 1.0
4 b True 2.0
5 b False 3.0
Alternatively, mask where condition does not apply as you compute the mean.或者,在计算平均值时屏蔽不适用条件的地方。
m=df.groupby('letters')['boolean'].transform('nunique')==1
df.loc[m, 'numbers']=df[m].groupby('letters')['numbers'].transform('mean')
Since you are comparing drectly to R, I would prefer to use siuba
rather than pandas
:由于您直接与 R 进行比较,因此我更喜欢使用
siuba
而不是pandas
:
from siuba import mutate, if_else, _, select, group_by, ungroup
df1 = df >>\
group_by(_.letters) >> \
mutate( condition = _.boolean.unique().size == 1,
numbers = if_else(_.condition, _.numbers.mean(), _.numbers)
) >>\
ungroup() >> select(-_.condition)
print(df1)
letters boolean numbers
0 a True 2.0
1 a True 2.0
2 a True 2.0
3 b True 1.0
4 b True 2.0
5 b False 3.0
Note that >>
is the pipe.请注意,
>>
是 pipe。 I added \
in order to jump to the next line.我添加了
\
为了跳到下一行。 Also note that to refer to the variables you use _.variable
另请注意,要引用您使用的变量
_.variable
datar
is another solution for you: datar
是您的另一种解决方案:
>>> import pandas as pd
>>> data = {"letters": ["a", "a", "a", "b", "b", "b"],
... "boolean": [True, True, True, True, True, False],
... "numbers": [1, 2, 3, 1, 2, 3]}
>>> df = pd.DataFrame(data)
>>>
>>> from datar.all import f, group_by, mutation, n_distinct, if_else, mean, select
>>> df >> group_by(f.letters) \
... >> mutate(
... condition=n_distinct(f.boolean) == 1,
... numbers = if_else(f.condition, mean(f.numbers), f.numbers)
... ) \
... >> select(~f.condition)
letters boolean numbers
<object> <bool> <float64>
0 a True 2.0
1 a True 2.0
2 a True 2.0
3 b True 1.0
4 b True 2.0
5 b False 3.0
[Groups: letters (n=2)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.