[英]Pandas: join on grouping keys after aggregation
I have a pandas frame like this: 我有一个像这样的熊猫框架:
df1 = pd.DataFrame({
'date': ['31-05-2017', '31-05-2017', '31-05-2017', '31-05-2017', '01-06-2017', '01-06-2017'],
'tag': ['A', 'B', 'B', 'B', 'A', 'A'],
'metric1': [0, 0, 0, 1, 1, 1],
'metric2': [0, 1, 1, 0, 1, 0]
})
df2 = pd.DataFrame({
'date': ['31-05-2017', '31-05-2017', '01-06-2017'],
'tag': ['A', 'B', 'A'],
'metric3': [25, 3, 7,]
})
1) I want to sum metric
and metric_2
per each combination of date
and tag
1)我要总结metric
和metric_2
%的每种组合date
和tag
2) compute the percentage of entries being 1
in metric_2
2)计算的条目被的百分比1
中metric_2
3) merge grouped df1 with df2 so that I have metric_3
for each date
and tag
3)将分组df1与df2合并,以便为每个date
和tag
设置metric_3
date | tag | metric1_sum | metric2_sum | metric2_percentage| metric 3
-----------|-----|-------------|-------------|-------------------|---------
31-05-2017 | A | 0 | 0 | 0 | 25
31-05-2017 | B | 1 | 2 | 0.667 | 3
01-06-2017 | A | 1 | 0 | 0.5 | 7
>>> g = df1.groupby(['date', 'tag']).agg(sum)
>>> g
metric1 metric2
date tag
01-06-2017 A 2 1
31-05-2017 A 0 0
B 1 2
I used the method posted here to calculate the percentage. 我用这里公布的方法来计算百分比。
>>> g2 = df1.groupby(['date', 'tag']).agg({'metric2': 'sum'})
>>> g2.groupby(level=0).apply(lambda x: x/float(x.sum()))
metric2
date tag
01-06-2017 A 1.0
31-05-2017 A 0.0
B 1.0
But, how can I now assign this grouped metric2
to a column metric2_percentage
in my groups g
or my df1
? 但是,我现在如何将此分组metric2
分配给我的组g
或我的df1
metric2_percentage
列?
Merging with the group apparently does not work: 与该组合并显然不起作用:
>>> pd.merge(g, df2, how='left', on=['date', 'tag'])
KeyError: 'date'
How can I then reduce df1
to one row per group so that I can merge it with df2
? 然后,我如何将每个组的df1
减少到一行,以便我可以将它与df2
合并?
g
has date, tag
as index, while merge
is expecting columns, you'll need to reset_index on g
: g
有date, tag
为索引,而merge
期望列,你需要在g
上重置reset:
pd.merge(g.reset_index(), df2, how='left', on=['date', 'tag'])
Or specify left_index = True
: 或者指定left_index = True
:
pd.merge(g, df2, how='left', left_index=True, right_on=['date', 'tag'])
Both give results as (with columns order slightly differ): 两者都给出结果(列顺序略有不同):
# date tag metric1 metric2 metric3
#0 01-06-2017 A 2 1 7
#1 31-05-2017 A 0 0 25
#2 31-05-2017 B 1 2 3
Here is an alternative that does your job with one less join: 这是一个替代方案,可以减少一个联接:
(df1.groupby(['date', 'tag']).apply(
lambda g: pd.Series({'metric1_sum': g.metric1.sum(),
'metric2_sum': g.metric2.sum(),
'metric2_percentage': g.metric2.mean()})
# assumed here you have only 1 and 0 in metric 2 column if not use your own lambda function
).reset_index().merge(df2, how='left', on=['date', 'tag']))
# date tag metric1_sum metric2_percentage metric2_sum metric3
#0 01-06-2017 A 2.0 0.500000 1.0 7
#1 31-05-2017 A 0.0 0.000000 0.0 25
#2 31-05-2017 B 1.0 0.666667 2.0 3
Use agg
. 使用agg
。 mean
of ones and zeros will be the same as percentage. 零和零的mean
将与百分比相同。
cols = ['date', 'tag']
d1 = df1.groupby(cols).agg(
dict(metric1='sum', metric2=['sum', 'mean'])
)
d1.columns = d1.columns.map('_'.join)
d1.join(df2.set_index(cols))
date tag metric1_sum metric2_sum metric2_mean metric3
0 01-06-2017 A 2 1 0.500000 7
1 31-05-2017 A 0 0 0.000000 25
2 31-05-2017 B 1 2 0.666667 3
Over-engineering for the sake of a one-liner 为了单线而过度工程化
from collections import OrderedDict
df1.groupby(['date', 'tag']).agg(
dict(metric1='sum', metric2=['sum', 'mean'])
).pipe(
lambda d: pd.DataFrame(OrderedDict({'_'.join(k): v for k, v in d.iteritems()}))
).join(df2.set_index(['date', 'tag'])).reset_index()
date tag metric1_sum metric2_sum metric2_mean metric3
0 01-06-2017 A 2 1 0.500000 7
1 31-05-2017 A 0 0 0.000000 25
2 31-05-2017 B 1 2 0.666667 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.