[英]Filter dataframe based on groupby and pandas series
我有以下數據框:
dct ={'store':('A','A','A','A','A','B','B','B','C','C','C'),
'station':('aisle','aisle','aisle','window','window','aisle','aisle','aisle','aisle','window','window'),
'produce':('apple','apple','orange','orange','orange','apple','apple','orange','apple','apple','orange')}
df = pd.DataFrame(dct)
print(df)
store station produce
A aisle apple
A aisle apple
A aisle orange
A window orange
A window orange
B aisle apple
B aisle apple
B aisle orange
C aisle apple
C window apple
C window orange
子集df基於:[基於商店、站點和生產的重復數據計數]與[基於商店、站點和生產的總計數]不同。 換句話說,如果任何商店只有基於商店、車站和生產的重復行,則將其刪除,但即使找到一個非重復記錄,也要包括行:
預期的數據框演練
store station produce
A aisle apple
A aisle apple
A aisle orange
A window orange ->exclude because store, station and produce match
A window orange ->exclude because store, station and produce match
B aisle apple
B aisle apple
B aisle orange
C aisle apple
C window apple
C window orange
預期數據框:
store station produce
A aisle apple
A aisle apple
A aisle orange
B aisle apple
B aisle apple
B aisle orange
C aisle apple
C window apple
C window orange
來自商店“B”的蘋果被包括在內,因為同一商店站也存在“橙色”,這使它成為例外。 從概念上講,我明白該怎么做,但無法在代碼中進行翻譯。
s = (df.duplicated(subset = ['store','station','produce'], keep=False))
sample = df[df.groupby(['store','station'])['station_ID'].sum().eq(dupli_count)] --> something going wrong here
我們可以用transform
nunique
試試groupby
df = df[df.groupby(['store', 'station'])['produce'].transform('nunique')!=1]
Out[43]:
store station produce
0 A aisle apple
1 A aisle apple
2 A aisle orange
5 B aisle apple
6 B aisle apple
7 B aisle orange
9 C window apple
10 C window orange
如果我們只想保留一行,請更新
g = df.groupby(['store', 'station'])['produce']
df = df[(g.transform('nunique')!=1) | (g.transform('count')==1)]
df
Out[46]:
store station produce
0 A aisle apple
1 A aisle apple
2 A aisle orange
5 B aisle apple
6 B aisle apple
7 B aisle orange
8 C aisle apple
9 C window apple
10 C window orange
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.