简体   繁体   中英

pandas dataframe groupby and get nth row

I have a pandas DataFrame like following.

df = pd.DataFrame([[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], list('AAABBBBABCBDDD'), [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8], ['x/y/z','x/y','x/y/z/n','x/u','x','x/u/v','x/y/z','x','x/u/v/b','-','x/y','x/y/z','x','x/u/v/w'],['1','3','3','2','4','2','5','3','6','3','5','1','1','1'],['200','400','404','200','200','404','200','404','500','200','500','200','200','400']]).T

df.columns = ['col1','col2','col3','col4','ID','col5']

I want group this by "ID" and get the 2nd row of each group. Later I will need to get 3rd and 4th also. Just explain me how to get only the 2nd row of each group.

I tried following which gives both first and second.

df.groupby('ID').head(2)

Instead I need to get only the second row. Since ID 4 and 6 has no second rows need to ignore them.

             col1 col2 col3     col4     ID    col5
ID                                           
1       0   1.1     A  1.1    x/y/z       1    200
        11  1.1     D  4.7    x/y/z       1    200
2       3   2.6     B  2.6      x/u       2    200
        5   3.4     B  3.8    x/u/v       2    404
3       1   1.1     A  1.7      x/y       3    400
        2   1.1     A  2.5  x/y/z/n       3    404
4       4   2.5     B  3.3        x       4    200
5       6   2.6     B    4    x/y/z       5    200
        10  2.6     B  4.6      x/y       5    500
6       8   3.4     B  4.3  x/u/v/b       6    500

I think the nth method is supposed to do just that:

In [10]: g = df.groupby('ID')
In [11]: g.nth(1).dropna()
Out[11]: 
    col1 col2  col3     col4 col5
ID                               
1    1.1    D   4.7    x/y/z  200
2    3.4    B   3.8    x/u/v  404
3    1.1    A   2.5  x/y/z/n  404
5    2.6    B   4.6      x/y  500

In 0.13 another way to do this is to use cumcount:

df[g.cumcount() == n - 1]

...which is significantly faster.

In [21]: %timeit g.nth(1).dropna()
100 loops, best of 3: 11.3 ms per loop

In [22]: %timeit df[g.cumcount() == 1]
1000 loops, best of 3: 286 µs per loop

If you use apply on the groupby, the function you pass is called on each group, passed as a DataFrame. So you can do:

df.groupby('ID').apply(lambda t: t.iloc[1])

However, this will raise an error if the group doesn't have at least two rows. If you want to exclude groups with fewer than two rows, that could be trickier. I'm not aware of a way to exclude the result of apply only for certain groups. You could try filtering the group list first by removing small groups, or return a one-row nan -filled DataFrame and do dropna on the result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM