[英]reduce a panda dataframe by groups
I've been searching extensively but can't get my head around this issue:我一直在广泛搜索,但无法解决这个问题:
I have a dataframe in pandas that looks like this:我在 pandas 中有一个如下所示的数据框:
date ticker Name NoShares SharePrice Volume Relation
2/1/10 aaa zzz 1 1 1 d
2/1/10 aaa yyy 1 2 5 o
2/1/10 aaa zzz 2 5 2 d
2/5/10 bbb xxx 5 5 1 do
2/5/10 ccc www 5 5 1 d
2/5/10 ccc www 5 5 1 d
2/5/10 ddd vvv 5 5 1 o
2/6/10 aaa zzz 1 1 3 d
Requirements要求
so my output would look like this:所以我的输出看起来像这样:
date ticker Name NoShares SharePrice Volume Relation
2/1/10 aaa zzz 3 3.6 1 d
2/1/10 aaa yyy 1 2 5 o
2/5/10 bbb xxx 5 5 1 do
2/5/10 ccc www 10 5 1 d
2/5/10 ddd vvv 5 5 1 o
2/6/10 aaa zzz 1 1 3 d
I tried the documentation and other answers on stack overflow but don't seem to be able to get it right.我尝试了有关堆栈溢出的文档和其他答案,但似乎无法正确解决。 Appreciate the help.
感谢帮助。 Cheers.
干杯。
here's my solution: 这是我的解决方案:
grpby = df.groupby(['date','Name'])
a = grpby.apply(lambda x: np.average(a = x['SharePrice'],weights=x['NoShares'])).to_frame(name='SharePrice')
b = grpby.agg({'NoShares':'sum','Volume':'mean','Relation':'max'})
print b.join(a)
Volume Relation NoShares SharePrice
date Name
2/1/10 yyy 5.0000 o 1 2.0000
zzz 1.5000 d 3 3.6667
2/5/10 vvv 1.0000 o 5 5.0000
www 1.0000 d 10 5.0000
xxx 1.0000 do 5 5.0000
2/6/10 zzz 3.0000 d 1 1.0000
just reset_index() afterwards. 之后只需reset_index()即可。
I made an assumption here. 我在这里做了一个假设。 When you said group by date and Name and to keep relation - I am assuming that ticker and relation will also be unique to those groups.
当您说出按日期和姓名分组并保持联系时,我假设这些股票的联系方式和联系也将是唯一的。 So for simplicity I am grouping by all 4.
因此,为简单起见,我将所有4个分组。
df = pd.DataFrame([
['2/1/10', 'aaa', 'zzz', 1, 1, 1, 'd'],
['2/1/10', 'aaa', 'yyy', 1, 2, 5, 'o'],
['2/1/10', 'aaa', 'zzz', 2, 5, 2, 'd'],
['2/5/10', 'bbb', 'xxx', 5, 5, 1, 'do'],
['2/5/10', 'ccc', 'www', 5, 5, 1, 'd'],
['2/5/10', 'ccc', 'www', 5, 5, 1, 'd'],
['2/5/10', 'ddd', 'vvv', 5, 5, 1, 'o'],
['2/6/10', 'aaa', 'zzz', 1, 1, 3, 'd']],
columns = ['date', 'ticker', 'Name', 'NoShares',
'SharePrice', 'Volume', 'Relation'])
def process_date(dg):
return pd.DataFrame([[
dg['NoShares'].sum(),
(dg['NoShares'] * dg['SharePrice']).sum() / dg['NoShares'].sum(),
dg['Volume'].mean(),
]], columns=['NoShares', 'SharePrice', 'Volume'])
df.groupby(['date', 'ticker', 'Name', 'Relation']).apply(process_date).reset_index(4, drop=True).reset_index(drop=False)
Results: 结果:
date ticker Name Relation NoShares SharePrice Volume
0 2/1/10 aaa yyy o 1 2.000000 5.0
1 2/1/10 aaa zzz d 3 3.666667 1.5
2 2/5/10 bbb xxx do 5 5.000000 1.0
3 2/5/10 ccc www d 10 5.000000 1.0
4 2/5/10 ddd vvv o 5 5.000000 1.0
5 2/6/10 aaa zzz d 1 1.000000 3.0
Both Dickster's and Leo's answers work well but just be aware that .groupby
has dropna=True
set by default. Dickster 和 Leo 的答案都很好,但请注意
.groupby
默认设置了dropna=True
。 So if you have a dataset and perform groupby
on multiple columns where some of those columns might contain NaN's
Pandas will drop these groups.因此,如果您有一个数据集并在其中一些列可能包含
NaN's
多个列上执行groupby
,则 Pandas 将删除这些组。 The final DataFrame
will have less rows then.最终的
DataFrame
将有更少的行。
The same SQL query on a SQL Server don't drop rows with NULL values in columns that are in a group by clause. SQL Server 上的同一 SQL 查询不会删除 group by 子句中的列中具有 NULL 值的行。 I don't know if that's true for other RDBMS but just bear in mind that Pandas by default treat
group by
in a different way.我不知道这是否适用于其他 RDBMS,但请记住,Pandas 默认以不同的方式处理
group by
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.