簡體   English   中英

分組並過濾pandas數據幀

[英]group and filter pandas dataframe

OID,TYPE,ResponseType
100,mod,ok
100,mod,ok
101,mod,ok
101,mod,ok
101,mod,ok
101,mod,ok
101,mod,no
102,mod,ok
102,mod,ok2
103,mod,ok
103,mod,no2

我想刪除所有沒有或沒有2作為響應的OID。

我試過了:

dfnew = df.groupby('OID').filter(lambda x: ((x['ResponseType']=='no') | x['ResponseType']=='no2')).any() )

但是我得到了SyntaxError:語法無效

另一個應用可能是制作一set要過濾的所有OID,然后使用它們來過濾df。df有5000000行!

預期的OP

OID,TYPE,ResponseType
100,mod,ok
100,mod,ok

102,mod,ok
102,mod,ok2

你需要添加一個(~用於反轉booelan面具 - 但它真的很慢:

dfnew = df.groupby('OID').filter(lambda x: ~((x['ResponseType']=='no') | 
                                             (x['ResponseType']=='no2')).any() )
                                          #here

print (dfnew)
   OID TYPE ResponseType
0  100  mod           ok
1  100  mod           ok
7  102  mod           ok
8  102  mod          ok2

另一種解決方案,使用boolean indexing和雙重isin更快:

oids = df.loc[df['ResponseType'].isin(['no','no2']), 'OID']
print (oids)
6     101
10    103
Name: OID, dtype: int64

dfnew = df[~df['OID'].isin(oids)]
print (dfnew)
   OID TYPE ResponseType
0  100  mod           ok
1  100  mod           ok
7  102  mod           ok
8  102  mod          ok2

有點unique慢點解決方案:

oids = df.loc[df['ResponseType'].isin(['no','no2']), 'OID'].unique()
print (oids)
[101 103]

時間

np.random.seed(123)
N = 1000000
df = pd.DataFrame({'ResponseType': np.random.choice(['ok','ok2','no2', 'no'], N),
                   'TYPE':['mod'] * N,
                   'OID':np.random.randint(100000, size=N)})
print (df)

In [285]: %timeit (df[~df['OID'].isin(df.loc[df['ResponseType'].isin(['no','no2']), 'OID'])])
10 loops, best of 3: 67.2 ms per loop

In [286]: %timeit (df[~df['OID'].isin(df.loc[df['ResponseType'].isin(['no','no2']), 'OID'].unique())])
10 loops, best of 3: 69.5 ms per loop

#zipa solution
In [287]: %timeit (df[~df['OID'].isin(df[df['ResponseType'].isin(['no', 'no2'])]['OID'])])
10 loops, best of 3: 91.5 ms per loop

#groupby solution :(
In [288]: %timeit (df.groupby('OID').filter(lambda x: ~((x['ResponseType']=='no') |  (x['ResponseType']=='no2')).any() ))
1 loop, best of 3: 1min 54s per loop

你可以這樣做:

df[~df['OID'].isin(df[df['ResponseType'].isin(['no', 'no2'])]['OID'])]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM