[英]Optimizing string match in Pandas
目前,我有以下代码行,在其中尝试在我的熊猫列中进行字符串匹配:
input_supplier = input_supplier[input_supplier['Category Level - 3'].str.contains(category, flags=re.IGNORECASE)]
但是,此操作需要很多时间。 熊猫df的大小是: (8098977, 16)
。
有什么方法可以优化此特定操作?
就像Josh Friedlander所说的那样,添加列然后进行过滤应该会更快一些:
len(df3)
9599904
# Creating a column then filtering
start_time = time.time()
search = ['Emma','Ryan','Gerald','Billy','Helen']
df3['search'] = df3['First'].str.contains('|'.join(search))
new_df = df3[df3['search'] == True]
end_time = time.time()
print(f'Elapsed time was {(end_time - start_time)} seconds')
Elapsed time was 6.525546073913574 seconds
只是做一个str.contains:
start_time = time.time()
search = ['Emma','Ryan','Gerald','Billy','Helen']
input_supplier = df3[df3['First'].str.contains('|'.join(search), flags=re.IGNORECASE)]
end_time = time.time()
print(f'Elapsed time was {(end_time - start_time)} seconds')
Elapsed time was 11.464462518692017 seconds
创建新列并对其进行过滤的速度大约是对str.contains()进行过滤的两倍。
将搜索列值和类别列表值转换为小写后,请使用快速numpy的“ where”和“ isin”函数。 如果列和/或类别列表包含非字符串,请首先转换为字符串。 如果要查看原始数据框中索引到搜索列结果的所有列,请删除最后一行的列标签。
import numpy as np
import pandas as pd
import re
names = np.array(['Walter', 'Emma', 'Gus', 'Ryan', 'Skylar', 'Gerald',
'Saul', 'Billy', 'Jesse', 'Helen'] * 1000000)
input_supplier = pd.DataFrame(names, columns=['Category Level - 3'])
len(input_supplier)
10000000
category = ['Emma', 'Ryan', 'Gerald', 'Billy', 'Helen']
方法1(注意此方法不能忽略大小写)
%%timeit
input_supplier['search'] = \
input_supplier['Category Level - 3'].str.contains('|'.join(category))
df1 = input_supplier[input_supplier['search'] == True]
4.42 s ± 37.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
方法2
%%timeit
df2 = input_supplier[input_supplier['Category Level - 3'].str.contains(
'|'.join(category), flags=re.IGNORECASE)]
5.45 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy方法忽略大小写:
%%timeit
lcase_vals = [x.lower() for x in input_supplier['Category Level - 3']]
category_lcase = [x.lower() for x in category]
df3 = input_supplier.iloc[np.where(np.isin(lcase_vals, category_lcase))[0]]
2.02 s ± 31.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
大小写匹配的方法:
%%timeit
col_vals = input_supplier['Category Level - 3'].values
df4 = input_supplier.iloc[np.where(np.isin(col_vals, category))[0]]
623 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.