简体   繁体   中英

Removing values from dataframe based on groupby condition

I'm a bit stuck here trying to determine how to slice my dataframe.

data = {'Date' : ['08/20/10','08/20/10','08/20/10','08/21/10','08/22/10','08/24/10','08/25/10','08/26/10'] , 'Receipt' : [10001,10001,10002,10002,10003,10004,10004,10004],
   'Product' : ['xx1','xx2','yy1','fff4','gggg4','fsf4','gggh5','hhhg6']}

dfTest = pd.DataFrame(data)
dfTest

This will produce:

    Date    Product    Receipt
0   08/20/10    xx1    10001
1   08/20/10    xx2    10001
2   08/20/10    yy1    10002
3   08/21/10    fff4    10002
4   08/22/10    gggg4   10003
5   08/24/10    fsf4    10004
6   08/25/10    gggh5   10004
7   08/26/10    hhhg6   10004

I want to create a new dataframe that only contains unique receipts, meaning the receipt should only be used on 1 day only (but it can be shown multiple times in 1 day). If the receipt shows up in multiple days, it needs to be removed. The above data set should look like this:

    Date    Product    Receipt
0   08/20/10    xx1    10001
1   08/20/10    xx2    10001
2   08/22/10    gggg4   10003

What I have done so far is:

dfTest.groupby(['Receipt','Date']).count()

              Product
Receipt Date    
10001   08/20/10    2
10002   08/20/10    1
        08/21/10    1
10003   08/22/10    1
10004   08/24/10    1
        08/25/10    1
        08/26/10    1

I didn't know how to do a query for that date in that kind of structure, so I reset the index.

df1 = dfTest.groupby(['Receipt','Date']).count().reset_index()


Receipt Date    Product
0   10001   08/20/10    2
1   10002   08/20/10    1
2   10002   08/21/10    1
3   10003   08/22/10    1
4   10004   08/24/10    1
5   10004   08/25/10    1
6   10004   08/26/10    1

Now I'm not sure how to proceed. I hope someone out there can lend a helping hand. This might be easy, I'm just a bit confused or lacking in experience.

You can use SeriesGroupBy.nunique with boolean indexing where condition use Series.isin :

df1 = dfTest.groupby(['Receipt'])['Date'].nunique()
print (df1)
Receipt
10001    1
10002    2
10003    1
10004    3
Name: Date, dtype: int64

#get indexes of all rows where length is 1
print (df1[df1 == 1].index)
Int64Index([10001, 10003], dtype='int64', name='Receipt')

#get all rows where in column Receipt are indexes with length 1
print (dfTest[dfTest['Receipt'].isin(df1[df1 == 1].index)])
       Date Product  Receipt
0  08/20/10     xx1    10001
1  08/20/10     xx2    10001
4  08/22/10   gggg4    10003

Another solution where find indexes by condition and then select DataFrame by loc :

print (dfTest.groupby(['Receipt']).filter(lambda x: x.Date.nunique()==1).index)
Int64Index([0, 1, 4], dtype='int64')


df1 = dfTest.loc[dfTest.groupby(['Receipt']).filter(lambda x: x.Date.nunique()==1).index]
print (df1)
       Date Product  Receipt
0  08/20/10     xx1    10001
1  08/20/10     xx2    10001
4  08/22/10   gggg4    10003

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM