[英]Aggregate multiple columns of qualitative data using pandas?
I want to go from this:我想从这个 go :
name![]() |
pet![]() |
|
---|---|---|
1 ![]() |
Rashida![]() |
dog![]() |
2 ![]() |
Rashida![]() |
cat![]() |
3 ![]() |
Jim![]() |
dog![]() |
4 ![]() |
JIm![]() |
dog![]() |
to this:对此:
name![]() |
num_dogs ![]() |
num_cats ![]() |
|
---|---|---|---|
1 ![]() |
Jim![]() |
2 ![]() |
0 ![]() |
2 ![]() |
Rashida![]() |
1 ![]() |
1 ![]() |
In R I would do在 R 我会做
df %>%
group_by(name) %>%
summarize(num_dogs = length(which(pet == "dog")),
num_cats = length(which(pet == "cat")))
How would I do this using pandas?我将如何使用 pandas 做到这一点?
There are lots of different ways to do this.有很多不同的方法可以做到这一点。
If you are filtering the value of a single column, then you can use the.agg with a custom lambda function.如果要过滤单个列的值,则可以将.agg 与自定义 lambda function 一起使用。
(df.groupby(["name"])
.agg(
num_dogs=("pet", lambda x: np.sum(x == "dog")),
num_cats=("pet", lambda x: np.sum(x == "cat")))
)
Or或者
(df
.groupby(["name", "pet"])
.size()
.unstack("pet", fill_value=0)
.add_prefix("num_").add_suffix("s")
)
You can also use a pivot table.您还可以使用 pivot 表。
df.reset_index().pivot_table(index="name", columns="pet", values="index", aggfunc="count", fill_value=0)
But if you need to filter based on two columns, then that approach will not work.但是,如果您需要基于两列进行过滤,那么该方法将不起作用。 For example if you need to know how many old dogs.
例如,如果您需要知道有多少只老狗。
df = pd.DataFrame({'name': ["Rashida", "Rashida", "Joe", "Joe"],
'pet': ['dog', 'cat', 'dog', 'dog'],
'age': ["old", "old", "old", "young"]})
You can use the pivot table.您可以使用 pivot 表。
df.reset_index().pivot_table(index="name", columns=["pet", "age"], values="index", aggfunc="count", fill_value=0)
Or a crosstabs.或交叉表。
pd.crosstab(df["name"], [df["pet"], df["age"]], dropna=False).unstack().reset_index()
Or you can use the port of Dplyr called siuba to mimic the original R syntax but I haven't used this enough to know how to use it well.或者,您可以使用名为 siuba 的 Dplyr 端口来模仿原始的 R 语法,但我还没有充分使用它,不知道如何很好地使用它。
from siuba import group_by, summarize, _
You can use datar
, which is backended by pandas:您可以使用由
datar
支持的 datar :
>>> from datar.all import f, tribble, length, group_by, which, summarise
>>>
>>> df = tribble(
... f.name, f.pet,
... "Rashida", "dog",
... "Rashida", "cat",
... "Jim", "dog",
... "Jim", "dog",
... )
>>>
>>> df >> group_by(f.name) >> summarise(
... num_dogs = length(which(f.pet == "dog")),
... num_cats = length(which(f.pet == "cat"))
... )
name num_dogs num_cats
<object> <int64> <int64>
0 Jim 2 0
1 Rashida 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.