简体   繁体   English

按组减少熊猫数据框

[英]reduce a panda dataframe by groups

I've been searching extensively but can't get my head around this issue:我一直在广泛搜索,但无法解决这个问题:

I have a dataframe in pandas that looks like this:我在 pandas 中有一个如下所示的数据框:

date    ticker Name NoShares SharePrice Volume Relation
2/1/10  aaa    zzz  1        1          1      d 
2/1/10  aaa    yyy  1        2          5      o
2/1/10  aaa    zzz  2        5          2      d  
2/5/10  bbb    xxx  5        5          1      do
2/5/10  ccc    www  5        5          1      d
2/5/10  ccc    www  5        5          1      d
2/5/10  ddd    vvv  5        5          1      o
2/6/10  aaa    zzz  1        1          3      d

Requirements要求

  1. I want to group by date and Name and:我想按日期和名称分组,并且:
  2. have the number of shares summed up合计股数
  3. have a weighted mean column for the share price (the weights are the NoShares)有一个股票价格的加权平均列(权重是 NoShares)
  4. average the volume and have it as a column平均音量并将其作为一列
  5. Leave relation as it is保持关系不变

so my output would look like this:所以我的输出看起来像这样:

date    ticker Name NoShares SharePrice Volume Relation
2/1/10  aaa    zzz  3        3.6        1      d 
2/1/10  aaa    yyy  1        2          5      o
2/5/10  bbb    xxx  5        5          1      do
2/5/10  ccc    www  10       5          1      d
2/5/10  ddd    vvv  5        5          1      o
2/6/10  aaa    zzz  1        1          3      d

I tried the documentation and other answers on stack overflow but don't seem to be able to get it right.我尝试了有关堆栈溢出的文档和其他答案,但似乎无法正确解决。 Appreciate the help.感谢帮助。 Cheers.干杯。

here's my solution: 这是我的解决方案:

grpby = df.groupby(['date','Name'])
a = grpby.apply(lambda x: np.average(a = x['SharePrice'],weights=x['NoShares'])).to_frame(name='SharePrice')
b = grpby.agg({'NoShares':'sum','Volume':'mean','Relation':'max'})
print b.join(a)

             Volume Relation  NoShares  SharePrice
date   Name                                       
2/1/10 yyy   5.0000        o         1      2.0000
       zzz   1.5000        d         3      3.6667
2/5/10 vvv   1.0000        o         5      5.0000
       www   1.0000        d        10      5.0000
       xxx   1.0000       do         5      5.0000
2/6/10 zzz   3.0000        d         1      1.0000

just reset_index() afterwards. 之后只需reset_index()即可。

I made an assumption here. 我在这里做了一个假设。 When you said group by date and Name and to keep relation - I am assuming that ticker and relation will also be unique to those groups. 当您说出按日期和姓名分组并保持联系时,我假设这些股票的联系方式和联系也将是唯一的。 So for simplicity I am grouping by all 4. 因此,为简单起见,我将所有4个分组。

df = pd.DataFrame([
                ['2/1/10', 'aaa', 'zzz', 1, 1, 1, 'd'], 
                ['2/1/10', 'aaa', 'yyy', 1, 2, 5, 'o'],
                ['2/1/10', 'aaa', 'zzz', 2, 5, 2, 'd'],  
                ['2/5/10', 'bbb', 'xxx', 5, 5, 1, 'do'],
                ['2/5/10', 'ccc', 'www', 5, 5, 1, 'd'],
                ['2/5/10', 'ccc', 'www', 5, 5, 1, 'd'],
                ['2/5/10', 'ddd', 'vvv', 5, 5, 1, 'o'],
                ['2/6/10', 'aaa', 'zzz', 1, 1, 3, 'd']],
             columns = ['date', 'ticker', 'Name', 'NoShares',
                        'SharePrice', 'Volume', 'Relation'])

def process_date(dg):
    return pd.DataFrame([[
                        dg['NoShares'].sum(),
                        (dg['NoShares'] * dg['SharePrice']).sum() / dg['NoShares'].sum(),
                        dg['Volume'].mean(),
                        ]], columns=['NoShares', 'SharePrice', 'Volume'])

df.groupby(['date', 'ticker', 'Name', 'Relation']).apply(process_date).reset_index(4, drop=True).reset_index(drop=False)

Results: 结果:

     date ticker Name Relation  NoShares  SharePrice  Volume
0  2/1/10    aaa  yyy        o         1    2.000000     5.0
1  2/1/10    aaa  zzz        d         3    3.666667     1.5
2  2/5/10    bbb  xxx       do         5    5.000000     1.0
3  2/5/10    ccc  www        d        10    5.000000     1.0
4  2/5/10    ddd  vvv        o         5    5.000000     1.0
5  2/6/10    aaa  zzz        d         1    1.000000     3.0

Both Dickster's and Leo's answers work well but just be aware that .groupby has dropna=True set by default. Dickster 和 Leo 的答案都很好,但请注意.groupby默认设置了dropna=True So if you have a dataset and perform groupby on multiple columns where some of those columns might contain NaN's Pandas will drop these groups.因此,如果您有一个数据集并在其中一些列可能包含NaN's多个列上执行groupby ,则 Pandas 将删除这些组。 The final DataFrame will have less rows then.最终的DataFrame将有更少的行。

The same SQL query on a SQL Server don't drop rows with NULL values in columns that are in a group by clause. SQL Server 上的同一 SQL 查询不会删除 group by 子句中的列中具有 NULL 值的行。 I don't know if that's true for other RDBMS but just bear in mind that Pandas by default treat group by in a different way.我不知道这是否适用于其他 RDBMS,但请记住,Pandas 默认以不同的方式处理group by

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM