简体   繁体   English

如何在pandas中groupby后取回索引

[英]How to get back the index after groupby in pandas

I am trying to find the the record with maximum value from the first record in each group after groupby and delete the same from the original dataframe.我试图从 groupby 之后的每个组中的第一条记录中找到具有最大值的记录,并从原始数据框中删除相同的记录。

import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'], 
                   'cost': [1, 2, 1, 1, 3, 1, 5]})
print df 
t = df.groupby('item_id').first() #lost track of the index
desired_row = t[t.cost == t.cost.max()]
#delete this row from df

         cost
item_id      
d           5

I need to keep track of desired_row and delete this row from df and repeat the process.我需要跟踪desired_row并从df删除该行并重复该过程。

What is the best way to find and delete the desired_row ?查找和删除desired_row的最佳方法是什么?

I am not sure of a general way, but this will work in your case since you are taking the first item of each group (it would also easily work on the last).我不确定一般的方法,但这将适用于您的情况,因为您正在选择每组的第一项(它也很容易在最后一项上工作)。 In fact, because of the general nature of split-aggregate-combine, I don't think this is easily achievable without doing it yourself.事实上,由于 split-aggregate-combine 的一般性质,我认为如果不自己动手,这是不容易实现的。

gb = df.groupby('item_id', as_index=False)
>>> gb.groups  # Index locations of each group.
{'a': [0, 1], 'b': [2, 3, 4], 'c': [5], 'd': [6]}

# Get the first index location from each group using a dictionary comprehension.
subset = {k: v[0] for k, v in gb.groups.iteritems()}
df2 = df.iloc[subset.values()]
# These are the first items in each groupby.
>>> df2
   cost item_id
0     1       a
5     1       c
2     1       b
6     5       d

# Exclude any items from above where the cost is equal to the max cost across the first item in each group.
>>> df[~df.index.isin(df2[df2.cost == df2.cost.max()].index)]
   cost item_id
0     1       a
1     2       a
2     1       b
3     1       b
4     3       b
5     1       c

Try this ?试试这个?

import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
                   'cost': [1, 2, 1, 1, 3, 1, 5]})
t=df.drop_duplicates(subset=['item_id'],keep='first')
desired_row = t[t.cost == t.cost.max()]
df[~df.index.isin([desired_row.index[0]])]

Out[186]: 
   cost item_id
0     1       a
1     2       a
2     1       b
3     1       b
4     3       b
5     1       c

Or using not in或使用不在

Consider this df with few more rows考虑这个 df 多几行

pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'd','d'], 
               'cost': [1, 2, 1, 1, 3, 1, 5,1,7]})

df[~df.cost.isin(df.groupby('item_id').first().max().tolist())]

    cost    item_id
0   1       a
1   2       a
2   1       b
3   1       b
4   3       b
5   1       c
7   1       d
8   7       d

Overview: Create a dataframe using an dictionary.概述:使用字典创建数据框。 Group by item_id and find the max value.按 item_id 分组并找到最大值。 enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value.枚举分组的数据帧并使用作为数值的键返回 alpha 索引值。 Create an result_df dataframe if you desire.如果需要,创建一个 result_df 数据框。

   df_temp = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'], 
               'cost': [1, 2, 1, 1, 3, 1, 5]})

   grouped=df_temp.groupby(['item_id'])['cost'].max()

   result_df=pd.DataFrame(columns=['item_id','cost'])

   for key, value in enumerate(grouped):
     index=grouped.index[key]
     result_df=result_df.append({'item_id':index,'cost':value},ignore_index=True)

   print(result_df.head(5))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM