[英]Percentage of total by group
Say I start with:说我开始:
In [1]: import polars as pl
In [2]: df = pl.DataFrame({
'group1': ['a', 'a', 'b', 'c', 'a', 'b'],
'group2': [0, 1, 1, 0, 1, 1]
})
In [3]: df
Out[3]:
shape: (6, 2)
┌────────┬────────┐
│ group1 ┆ group2 │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪════════╡
│ a ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ a ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ c ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ a ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
└────────┴────────┘
I'd like to get, for each group1
, the distribution of group2
.我想为每个group1
获取group2
的分布。
My desired outcome is:我想要的结果是:
shape: (4, 4)
┌────────┬────────┬───────┬────────────┐
│ group1 ┆ group2 ┆ count ┆ percentage │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 ┆ f64 │
╞════════╪════════╪═══════╪════════════╡
│ a ┆ 0 ┆ 1 ┆ 0.333333 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 1 ┆ 2 ┆ 0.666667 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 ┆ 1.0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 0 ┆ 1 ┆ 1.0 │
└────────┴────────┴───────┴────────────┘
Here's one way I've found to do it - is there a more idiomatic way in polars?这是我发现的一种方法 - 在极地中是否有更惯用的方法?
counts = df.groupby(['group1', 'group2']).count()
counts.with_column(
(
counts['count']
/ counts.select(pl.col('count').sum().over('group1'))['count']
).alias('percentage')
).sort(['group1', 'group2'])
You are on the right path, but it is better to use expressions all the way and don't construct/access intermediate dataframes.你走在正确的道路上,但最好一直使用表达式并且不要构建/访问中间数据帧。
(df.groupby(["group1", "group2"])
.agg([
pl.count()
])
).select([
pl.all().exclude("count"),
(pl.col("count") / pl.sum("count").over("group1")).alias("percentage")
])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.