简体   繁体   English

在熊猫中优化字符串匹配

[英]Optimizing string match in Pandas

Currently, I have the following line, where I try to do a string match in a column of my pandas: 目前,我有以下代码行,在其中尝试在我的熊猫列中进行字符串匹配:

input_supplier = input_supplier[input_supplier['Category Level - 3'].str.contains(category, flags=re.IGNORECASE)]

However, this operation takes a lot of time. 但是,此操作需要很多时间。 The size of the pandas df is: (8098977, 16) . 熊猫df的大小是: (8098977, 16)

Is there any way to optimize this particular operation? 有什么方法可以优化此特定操作?

Like Josh Friedlander said it will it should be a little faster adding a column and then filtering: 就像Josh Friedlander所说的那样,添加列然后进行过滤应该会更快一些:

len(df3)

9599904

# Creating a column then filtering
start_time = time.time()
search = ['Emma','Ryan','Gerald','Billy','Helen']
df3['search'] = df3['First'].str.contains('|'.join(search))
new_df = df3[df3['search'] == True]
end_time = time.time()
print(f'Elapsed time was {(end_time - start_time)} seconds')

Elapsed time was 6.525546073913574 seconds

just doing a str.contains: 只是做一个str.contains:

start_time = time.time()
search = ['Emma','Ryan','Gerald','Billy','Helen']
input_supplier = df3[df3['First'].str.contains('|'.join(search), flags=re.IGNORECASE)]
end_time = time.time()
print(f'Elapsed time was {(end_time - start_time)} seconds')

Elapsed time was 11.464462518692017 seconds

It is about twice as fast to create a new column and filter on that than to filter on str.contains() 创建新列并对其进行过滤的速度大约是对str.contains()进行过滤的两倍。

Use fast numpy "where" and "isin" functions after converting search column values and category list values to lower case. 将搜索列值和类别列表值转换为小写后,请使用快速numpy的“ where”和“ isin”函数。 If the column and/or category list contain non-strings, convert to strings first. 如果列和/或类别列表包含非字符串,请首先转换为字符串。 Delete the column label in the last line if you want to see all columns from the original dataframe indexed to the search column results. 如果要查看原始数据框中索引到搜索列结果的所有列,请删除最后一行的列标签。

import numpy as np
import pandas as pd
import re

names = np.array(['Walter', 'Emma', 'Gus', 'Ryan', 'Skylar', 'Gerald',
                  'Saul', 'Billy', 'Jesse', 'Helen'] * 1000000)
input_supplier = pd.DataFrame(names, columns=['Category Level - 3'])

len(input_supplier)
10000000

category = ['Emma', 'Ryan', 'Gerald', 'Billy', 'Helen']

Method 1 (note this method does not ignore case) 方法1(注意此方法不能忽略大小写)

%%timeit
input_supplier['search'] = \
    input_supplier['Category Level - 3'].str.contains('|'.join(category))
df1 = input_supplier[input_supplier['search'] == True]

4.42 s ± 37.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Method 2 方法2

%%timeit
df2 = input_supplier[input_supplier['Category Level - 3'].str.contains(
    '|'.join(category), flags=re.IGNORECASE)]

5.45 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numpy method ignoring case: numpy方法忽略大小写:

%%timeit
lcase_vals = [x.lower() for x in input_supplier['Category Level - 3']]
category_lcase = [x.lower() for x in category]
df3 = input_supplier.iloc[np.where(np.isin(lcase_vals, category_lcase))[0]]

2.02 s ± 31.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numpy method if matching case: 大小写匹配的方法:

%%timeit
col_vals = input_supplier['Category Level - 3'].values
df4 = input_supplier.iloc[np.where(np.isin(col_vals, category))[0]]

623 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM