简体   繁体   中英

Keeping the N first occurrences of

The following code will (of course) keep only the first occurrence of 'Item1' in rows sorted by 'Date'. Any suggestions as to how I could get it to keep, say the first 5 occurrences?

## Sort the dataframe by Date and keep only the earliest appearance of 'Item1'
## drop_duplicates considers the column 'Date' and keeps only first occurence

coocdates = data.sort('Date').drop_duplicates(cols=['Item1'])

You want to use head , either on the dataframe itself or on the groupby :

In [11]: df = pd.DataFrame([[1, 2], [1, 4], [1, 6], [2, 8]], columns=['A', 'B'])

In [12]: df
Out[12]:
   A  B
0  1  2
1  1  4
2  1  6
3  2  8

In [13]: df.head(2)  # the first two rows
Out[13]:
   A  B
0  1  2
1  1  4

In [14]: df.groupby('A').head(2)  # the first two rows in each group
Out[14]:
   A  B
0  1  2
1  1  4
3  2  8

Note: the behaviour of groupby's head was changed in 0.14 (it didn't act like a filter - but modified the index), so you will have to reset index if using an earlier versions.

Use groupby() and nth() :

According to Pandas docs , nth()

Take the nth row from each group if n is an int, or a subset of rows if n is a list of ints.

Therefore all you need is:

df.groupby('Date').nth([0,1,2,3,4]).reset_index(drop=False, inplace=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM