簡體   English   中英

查找每個組中的前 N ​​個值

[英]Find top N values within each group

我有一個類似於以下示例的數據集:

| id | size   | old_a | old_b | new_a | new_b |
|----|--------|-------|-------|-------|-------|
| 6  | small  | 3     | 0     | 21    | 0     |
| 6  | small  | 9     | 0     | 23    | 0     |
| 13 | medium | 3     | 0     | 12    | 0     |
| 13 | medium | 37    | 0     | 20    | 1     |
| 20 | medium | 30    | 0     | 5     | 6     |
| 20 | medium | 12    | 2     | 3     | 0     |
| 12 | small  | 7     | 0     | 2     | 0     |
| 10 | small  | 8     | 0     | 12    | 0     |
| 15 | small  | 19    | 0     | 3     | 0     |
| 15 | small  | 54    | 0     | 8     | 0     |
| 87 | medium | 6     | 0     | 9     | 0     |
| 90 | medium | 11    | 1     | 16    | 0     |
| 90 | medium | 25    | 0     | 4     | 0     |
| 90 | medium | 10    | 0     | 5     | 0     |
| 9  | large  | 8     | 1     | 23    | 0     |
| 9  | large  | 19    | 0     | 2     | 0     |
| 1  | large  | 1     | 0     | 0     | 0     |
| 50 | large  | 34    | 0     | 7     | 0     |

這是上表的輸入:

data=[[6,'small',3,0,21,0],[6,'small',9,0,23,0],[13,'medium',3,0,12,0],[13,'medium',37,0,20,1],[20,'medium',30,0,5,6],[20,'medium',12,2,3,0],[12,'small',7,0,2,0],[10,'small',8,0,12,0],[15,'small',19,0,3,0],[15,'small',54,0,8,0],[87,'medium',6,0,9,0],[90,'medium',11,1,16,0],[90,'medium',25,0,4,0],[90,'medium',10,0,5,0],[9,'large',8,1,23,0],[9,'large',19,0,2,0],[1,'large',1,0,0,0],[50,'large',34,0,7,0]]
data= pd.DataFrame(data,columns=['id','size','old_a','old_b','new_a','new_b'])

我想要一個輸出,它將根據大小對數據集進行分組,並根據每組大小中“new_a”列的值列出前 2 個 id。 由於某些 id 重復多次,我想對這些 id 的 new_a 值求和,然后找到前 2 個值。 我的決賽桌應該如下所示:

| size   | id | new_a |
|--------|----|-------|
| large  | 9  | 25    |
| large  | 50 | 7     |
| medium | 13 | 32    |
| medium | 90 | 25    |
| small  | 6  | 44    |
| small  | 10 | 12    |

我已經嘗試了下面的代碼,但它沒有顯示“大小”列中每個組的 new_a 的前 2 個值。

nlargest = data.groupby(['size','id'])['new_a'].sum().nlargest(2).reset_index()
print(
    df.groupby('size').apply(
        lambda x: x.groupby('id').sum().nlargest(2, columns='new_a')
    ).reset_index()[['size', 'id', 'new_a']]
)

印刷:

     size  id  new_a
0   large   9     25
1   large  50      7
2  medium  13     32
3  medium  90     25
4   small   6     44
5   small  10     12

您可以在這里設置sizeid作為索引以避免雙重 groupby,並使用Series.sum利用level參數。

df.set_index(["size", "id"]).groupby(level=0).apply(
    lambda x: x.sum(level=1).nlargest(2)
).reset_index()

     size  id  new_a
0   large   9     25
1   large  50      7
2  medium  13     32
3  medium  90     25
4   small   6     44
5   small  10     12

您可以鏈接兩個groupby方法:

data.groupby(['id', 'size'])['new_a'].sum().groupby('size').nlargest(2)\
.droplevel(0).to_frame('new_a').reset_index()

輸出:

   id    size  new_a
0   9   large     25
1  50   large      7
2  13  medium     32
3  90  medium     25
4   6   small     44
5  10   small     12

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM