简体   繁体   English

在python中记录每个id的最大系列

[英]Record the largest series for each id in python

I want to to keep one record that has the largest series for each id. 我想保留一个具有每个id最大系列的记录。 So for each id I need one row. 所以对于每个id,我需要一行。 I think I need something like 我想我需要类似的东西

df_new = df.groupby('id')['series'].nlargest(1)

, but that's definitely wrong. ,但那肯定是错的。

That's how my dataset looks: 这就是我的数据集的外观:

id  series s1 s2 s3
1   2      4  9  1
1   8      6  2  2
1   3      9  1  3
2   9      4  1  5
2   2      2  5  5
2   5      1  7  8
3   6      7  2  3
3   2      4  4  1
3   1      3  9  9

This should be the result: 这应该是结果:

id  series s1 s2 s3
1   8      6  2  2
2   9      4  1  5
3   6      7  2  3

IIUC you want to groupby on 'id' column and get the index label where the 'Series' value is the largest using idxmax() and use this to index back in the orig df: IIUC要groupby的“ID”栏,并得到索引标签,其中,“系列”值是使用最大idxmax()并使用该索引早在原稿DF:

In [91]:
df.loc[df.groupby('id')['series'].idxmax()]

Out[91]:
   id  series  s1  s2  s3
1   1       8   6   2   2
3   2       9   4   1   5
6   3       6   7   2   3

Another solution with sort_values and aggregate first : 与另一种溶液sort_values和聚集first

df = df.sort_values(by="series", ascending=False).groupby("id", as_index=False).first()
print (df)
   id  series  s1  s2  s3
0   1       8   6   2   2
1   2       9   4   1   5
2   3       6   7   2   3

Here's one NumPy based solution - 这是一个基于NumPy的解决方案 -

def grouby_max(df):
    arr = df[['id','series']].values
    n = arr.shape[0]-1
    idx = (arr[:,0]*(arr[:,1].max()+1) + arr[:,1]).argsort()
    sidx = np.append(np.nonzero(arr[idx[1:],0] > arr[idx[:-1],0])[0],n)
    return df.iloc[idx[sidx]]

Runtime test - 运行时测试 -

In [201]: # Setup input
     ...: N = 100 # Number of groups
     ...: data = np.random.randint(11,999999,(10000,5))
     ...: data[:,0] = np.sort(np.random.randint(1,N+1,(data.shape[0])))
     ...: df = pd.DataFrame(data, columns=[['id','series','s1','s2','s3']])
     ...: 

In [202]: %timeit df.loc[df.groupby('id')['series'].idxmax()]
100 loops, best of 3: 15.8 ms per loop #@EdChum's soln

In [203]: %timeit df.sort_values(by="series", ascending=False).groupby("id", as_index=False).first()
100 loops, best of 3: 4.52 ms per loop #@jezrael's soln

In [204]: %timeit grouby_max(df)
100 loops, best of 3: 1.96 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM