[英]group and filter pandas dataframe
OID,TYPE,ResponseType
100,mod,ok
100,mod,ok
101,mod,ok
101,mod,ok
101,mod,ok
101,mod,ok
101,mod,no
102,mod,ok
102,mod,ok2
103,mod,ok
103,mod,no2
我想删除所有没有或没有2作为响应的OID。
我试过了:
dfnew = df.groupby('OID').filter(lambda x: ((x['ResponseType']=='no') | x['ResponseType']=='no2')).any() )
但是我得到了SyntaxError:语法无效
另一个应用可能是制作一set
要过滤的所有OID,然后使用它们来过滤df。df有5000000行!
预期的OP
OID,TYPE,ResponseType
100,mod,ok
100,mod,ok
102,mod,ok
102,mod,ok2
你需要添加一个(
和~
用于反转booelan面具 - 但它真的很慢:
dfnew = df.groupby('OID').filter(lambda x: ~((x['ResponseType']=='no') |
(x['ResponseType']=='no2')).any() )
#here
print (dfnew)
OID TYPE ResponseType
0 100 mod ok
1 100 mod ok
7 102 mod ok
8 102 mod ok2
另一种解决方案,使用boolean indexing
和双重isin
更快:
oids = df.loc[df['ResponseType'].isin(['no','no2']), 'OID']
print (oids)
6 101
10 103
Name: OID, dtype: int64
dfnew = df[~df['OID'].isin(oids)]
print (dfnew)
OID TYPE ResponseType
0 100 mod ok
1 100 mod ok
7 102 mod ok
8 102 mod ok2
有点unique
慢点解决方案:
oids = df.loc[df['ResponseType'].isin(['no','no2']), 'OID'].unique()
print (oids)
[101 103]
时间 :
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'ResponseType': np.random.choice(['ok','ok2','no2', 'no'], N),
'TYPE':['mod'] * N,
'OID':np.random.randint(100000, size=N)})
print (df)
In [285]: %timeit (df[~df['OID'].isin(df.loc[df['ResponseType'].isin(['no','no2']), 'OID'])])
10 loops, best of 3: 67.2 ms per loop
In [286]: %timeit (df[~df['OID'].isin(df.loc[df['ResponseType'].isin(['no','no2']), 'OID'].unique())])
10 loops, best of 3: 69.5 ms per loop
#zipa solution
In [287]: %timeit (df[~df['OID'].isin(df[df['ResponseType'].isin(['no', 'no2'])]['OID'])])
10 loops, best of 3: 91.5 ms per loop
#groupby solution :(
In [288]: %timeit (df.groupby('OID').filter(lambda x: ~((x['ResponseType']=='no') | (x['ResponseType']=='no2')).any() ))
1 loop, best of 3: 1min 54s per loop
你可以这样做:
df[~df['OID'].isin(df[df['ResponseType'].isin(['no', 'no2'])]['OID'])]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.