简体   繁体   中英

Filter dataframe based on groupby and pandas series

I have following dataframe:

dct ={'store':('A','A','A','A','A','B','B','B','C','C','C'),
     'station':('aisle','aisle','aisle','window','window','aisle','aisle','aisle','aisle','window','window'),
     'produce':('apple','apple','orange','orange','orange','apple','apple','orange','apple','apple','orange')}

df = pd.DataFrame(dct)


print(df)

store  station    produce
A      aisle      apple
A      aisle      apple
A      aisle      orange
A      window     orange
A      window     orange
B      aisle      apple     
B      aisle      apple
B      aisle      orange
C      aisle      apple
C      window     apple
C      window     orange

subset df based on: [Count of duplicate data based on store, station, and produce] is not same as [total count based on store, station, & produce]. In other words, if any store has only duplicate rows based on store, station, and produce then remove it however include rows even if one non-duplicate record found:

Expected dataframe walkthrough

store  station    produce
A      aisle      apple
A      aisle      apple
A      aisle      orange
A      window     orange  ->exclude because store, station and produce match
A      window     orange  ->exclude because store, station and produce match
B      aisle      apple     
B      aisle      apple
B      aisle      orange
C      aisle      apple
C      window     apple
C      window     orange

expected dataframe:

store  station    produce
A      aisle      apple
A      aisle      apple
A      aisle      orange
B      aisle      apple     
B      aisle      apple
B      aisle      orange
C      aisle      apple
C      window     apple
C      window     orange

From Store 'B' apple is included because 'orange' is also present for same store-station makes it exception. Conceptually, I understand what to do, but unable to translate it in the code.

s = (df.duplicated(subset = ['store','station','produce'], keep=False))
sample = df[df.groupby(['store','station'])['station_ID'].sum().eq(dupli_count)] --> something going wrong here

We can try groupby with transform nunique

df = df[df.groupby(['store', 'station'])['produce'].transform('nunique')!=1]
Out[43]: 
   store station produce
0      A   aisle   apple
1      A   aisle   apple
2      A   aisle  orange
5      B   aisle   apple
6      B   aisle   apple
7      B   aisle  orange
9      C  window   apple
10     C  window  orange

Update if we would like keep the one row only group

g = df.groupby(['store', 'station'])['produce']
df = df[(g.transform('nunique')!=1) | (g.transform('count')==1)]
df
Out[46]: 
   store station produce
0      A   aisle   apple
1      A   aisle   apple
2      A   aisle  orange
5      B   aisle   apple
6      B   aisle   apple
7      B   aisle  orange
8      C   aisle   apple
9      C  window   apple
10     C  window  orange

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM