检查元素的pandas数据帧的最快方法是什么？

Question

I'm a bit confused regarding the best way to check a pandas dataframe column for items. 关于检查项目的pandas数据帧列的最佳方法，我有点困惑。

I am writing a program whereby if the dataframe has elements in a certain column which are not allowed, an error is raised. 我正在编写一个程序，如果数据框中某个列中的元素不允许，则会引发错误。

Here's an example: 这是一个例子：

import pandas as pd

raw_data = {'first_name': ['Jay', 'Jason', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Jones', 'Miller', 'Ali', 'Milner', 'Cooze'], 
        'age': [47, 42, 36, 24, 73], 
        'preTestScore': [4, 4, 31, 2, 3],
        'postTestScore': [27, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
print(df)

which outputs 哪个输出

  first_name last_name  age  preTestScore  postTestScore
0      Jay       Jones   47             4             27
1      Jason    Miller   42             4             25
2       Tina       Ali   36            31             57
3       Jake    Milner   24             2             62
4        Amy     Cooze   73             3             70

If column last_name contains anything besides Jones , Miller , Ali , Milner , or Cooze , raise a warning. 如果列last_name包含除Jones ， Miller ， Ali ， Milner或Cooze之外的任何内容，则发出警告。

One could possibly use pandas.DataFrame.isin , but it's not clear to me this is the most efficient approach. 有人可能会使用pandas.DataFrame.isin ，但我不清楚这是最有效的方法。

Something like: 就像是：

if df.isin('last_name':{'Jones', 'Miller', 'Ali', 'Milner', 'Cooze'}).any() == False:
    raise:
        ValueError("Column `last_name` includes ill-formed elements.")

Answer 1

I think you can use all for check if match all values: 我认为如果匹配所有值，你可以使用all进行检查：

if not df['last_name'].isin(['Jones', 'Miller', 'Ali', 'Milner', 'Cooze']).all():
    raise ValueError("Column `last_name` includes ill-formed elements.")

Another solution with issubset : issubset另一个解决方案：

if not set(['Jones', 'Miller', 'Ali', 'Milner', 'Cooze']).issubset(df['last_name']):
    raise ValueError("Column `last_name` includes ill-formed elements.")

Timings : 时间：

np.random.seed(123)
N = 10000
L = list('abcdefghijklmno') 

df = pd.DataFrame({'last_name': np.random.choice(L, N)})
print (df)

In [245]: %timeit df['last_name'].isin(L).all()
The slowest run took 4.73 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 421 µs per loop

In [247]: %timeit set(L).issubset(df['last_name'])
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 273 µs per loop

In [248]: %timeit df.loc[~df['last_name'].isin(L), 'last_name'].any()
1000 loops, best of 3: 562 µs per loop

Caveat : 警告：

Performance really depend on the data - number of rows and number of non matched values. 性能实际上取决于数据 - 行数和非匹配值的数量。

Answer 2

You can use loc : 您可以使用loc ：

if df.loc[~df['last_name'].isin({'Jones', 'Miller', 'Ali', 'Milner', 'Cooze'}), 'last_name'].any():
    raise ValueError("Column `last_name` includes ill-formed elements.")

This checks if there are other values in last_name aside of those specified. 这将检查last_name是否存在除指定值之外的其他值。

检查元素的pandas数据帧的最快方法是什么？

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-12-20 10:13:36

解决方案2
2 2017-12-20 10:23:10

检查元素的pandas数据帧的最快方法是什么？

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-12-20 10:13:36

解决方案2 2 2017-12-20 10:23:10

解决方案1
2 已采纳 2017-12-20 10:13:36

解决方案2
2 2017-12-20 10:23:10