在python中记录每个id的最大系列

Question

I want to to keep one record that has the largest series for each id. 我想保留一个具有每个id最大系列的记录。 So for each id I need one row. 所以对于每个id，我需要一行。 I think I need something like 我想我需要类似的东西

df_new = df.groupby('id')['series'].nlargest(1)

, but that's definitely wrong. ，但那肯定是错的。

That's how my dataset looks: 这就是我的数据集的外观：

id  series s1 s2 s3
1   2      4  9  1
1   8      6  2  2
1   3      9  1  3
2   9      4  1  5
2   2      2  5  5
2   5      1  7  8
3   6      7  2  3
3   2      4  4  1
3   1      3  9  9

This should be the result: 这应该是结果：

id  series s1 s2 s3
1   8      6  2  2
2   9      4  1  5
3   6      7  2  3

Answer 1

IIUC you want to groupby on 'id' column and get the index label where the 'Series' value is the largest using idxmax() and use this to index back in the orig df: IIUC要groupby的“ID”栏，并得到索引标签，其中，“系列”值是使用最大idxmax()并使用该索引早在原稿DF：

In [91]:
df.loc[df.groupby('id')['series'].idxmax()]

Out[91]:
   id  series  s1  s2  s3
1   1       8   6   2   2
3   2       9   4   1   5
6   3       6   7   2   3

Answer 2

Another solution with sort_values and aggregate first : 与另一种溶液sort_values和聚集first ：

df = df.sort_values(by="series", ascending=False).groupby("id", as_index=False).first()
print (df)
   id  series  s1  s2  s3
0   1       8   6   2   2
1   2       9   4   1   5
2   3       6   7   2   3

Answer 3

Here's one NumPy based solution - 这是一个基于NumPy的解决方案 -

def grouby_max(df):
    arr = df[['id','series']].values
    n = arr.shape[0]-1
    idx = (arr[:,0]*(arr[:,1].max()+1) + arr[:,1]).argsort()
    sidx = np.append(np.nonzero(arr[idx[1:],0] > arr[idx[:-1],0])[0],n)
    return df.iloc[idx[sidx]]

Runtime test - 运行时测试 -

In [201]: # Setup input
     ...: N = 100 # Number of groups
     ...: data = np.random.randint(11,999999,(10000,5))
     ...: data[:,0] = np.sort(np.random.randint(1,N+1,(data.shape[0])))
     ...: df = pd.DataFrame(data, columns=[['id','series','s1','s2','s3']])
     ...: 

In [202]: %timeit df.loc[df.groupby('id')['series'].idxmax()]
100 loops, best of 3: 15.8 ms per loop #@EdChum's soln

In [203]: %timeit df.sort_values(by="series", ascending=False).groupby("id", as_index=False).first()
100 loops, best of 3: 4.52 ms per loop #@jezrael's soln

In [204]: %timeit grouby_max(df)
100 loops, best of 3: 1.96 ms per loop

在python中记录每个id的最大系列

问题描述

3 个解决方案

解决方案1
6 2016-11-03 16:22:25

解决方案2
4 已采纳 2016-11-03 16:24:10

解决方案3
3 2016-11-03 18:59:16

在python中记录每个id的最大系列

问题描述

3 个解决方案

解决方案1 6 2016-11-03 16:22:25

解决方案2 4 已采纳 2016-11-03 16:24:10

解决方案3 3 2016-11-03 18:59:16

解决方案1
6 2016-11-03 16:22:25

解决方案2
4 已采纳 2016-11-03 16:24:10

解决方案3
3 2016-11-03 18:59:16