[英]What is the fastest way to check a pandas dataframe for elements?
I'm a bit confused regarding the best way to check a pandas dataframe column for items. 关于检查项目的pandas数据帧列的最佳方法,我有点困惑。
I am writing a program whereby if the dataframe has elements in a certain column which are not allowed, an error is raised. 我正在编写一个程序,如果数据框中某个列中的元素不允许,则会引发错误。
Here's an example: 这是一个例子:
import pandas as pd
raw_data = {'first_name': ['Jay', 'Jason', 'Tina', 'Jake', 'Amy'],
'last_name': ['Jones', 'Miller', 'Ali', 'Milner', 'Cooze'],
'age': [47, 42, 36, 24, 73],
'preTestScore': [4, 4, 31, 2, 3],
'postTestScore': [27, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
print(df)
which outputs 哪个输出
first_name last_name age preTestScore postTestScore
0 Jay Jones 47 4 27
1 Jason Miller 42 4 25
2 Tina Ali 36 31 57
3 Jake Milner 24 2 62
4 Amy Cooze 73 3 70
If column last_name
contains anything besides Jones
, Miller
, Ali
, Milner
, or Cooze
, raise a warning. 如果列
last_name
包含除Jones
, Miller
, Ali
, Milner
或Cooze
之外的任何内容,则发出警告。
One could possibly use pandas.DataFrame.isin
, but it's not clear to me this is the most efficient approach. 有人可能会使用
pandas.DataFrame.isin
,但我不清楚这是最有效的方法。
Something like: 就像是:
if df.isin('last_name':{'Jones', 'Miller', 'Ali', 'Milner', 'Cooze'}).any() == False:
raise:
ValueError("Column `last_name` includes ill-formed elements.")
I think you can use all
for check if match all values: 我认为如果匹配所有值,你可以使用
all
进行检查:
if not df['last_name'].isin(['Jones', 'Miller', 'Ali', 'Milner', 'Cooze']).all():
raise ValueError("Column `last_name` includes ill-formed elements.")
Another solution with issubset
: issubset
另一个解决方案:
if not set(['Jones', 'Miller', 'Ali', 'Milner', 'Cooze']).issubset(df['last_name']):
raise ValueError("Column `last_name` includes ill-formed elements.")
Timings : 时间 :
np.random.seed(123)
N = 10000
L = list('abcdefghijklmno')
df = pd.DataFrame({'last_name': np.random.choice(L, N)})
print (df)
In [245]: %timeit df['last_name'].isin(L).all()
The slowest run took 4.73 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 421 µs per loop
In [247]: %timeit set(L).issubset(df['last_name'])
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 273 µs per loop
In [248]: %timeit df.loc[~df['last_name'].isin(L), 'last_name'].any()
1000 loops, best of 3: 562 µs per loop
Caveat : 警告 :
Performance really depend on the data - number of rows and number of non matched values. 性能实际上取决于数据 - 行数和非匹配值的数量。
You can use loc
: 您可以使用
loc
:
if df.loc[~df['last_name'].isin({'Jones', 'Miller', 'Ali', 'Milner', 'Cooze'}), 'last_name'].any():
raise ValueError("Column `last_name` includes ill-formed elements.")
This checks if there are other values in last_name
aside of those specified. 这将检查
last_name
是否存在除指定值之外的其他值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.