简体   繁体   中英

Optimizing string match in Pandas

Currently, I have the following line, where I try to do a string match in a column of my pandas:

input_supplier = input_supplier[input_supplier['Category Level - 3'].str.contains(category, flags=re.IGNORECASE)]

However, this operation takes a lot of time. The size of the pandas df is: (8098977, 16) .

Is there any way to optimize this particular operation?

Like Josh Friedlander said it will it should be a little faster adding a column and then filtering:

len(df3)

9599904

# Creating a column then filtering
start_time = time.time()
search = ['Emma','Ryan','Gerald','Billy','Helen']
df3['search'] = df3['First'].str.contains('|'.join(search))
new_df = df3[df3['search'] == True]
end_time = time.time()
print(f'Elapsed time was {(end_time - start_time)} seconds')

Elapsed time was 6.525546073913574 seconds

just doing a str.contains:

start_time = time.time()
search = ['Emma','Ryan','Gerald','Billy','Helen']
input_supplier = df3[df3['First'].str.contains('|'.join(search), flags=re.IGNORECASE)]
end_time = time.time()
print(f'Elapsed time was {(end_time - start_time)} seconds')

Elapsed time was 11.464462518692017 seconds

It is about twice as fast to create a new column and filter on that than to filter on str.contains()

Use fast numpy "where" and "isin" functions after converting search column values and category list values to lower case. If the column and/or category list contain non-strings, convert to strings first. Delete the column label in the last line if you want to see all columns from the original dataframe indexed to the search column results.

import numpy as np
import pandas as pd
import re

names = np.array(['Walter', 'Emma', 'Gus', 'Ryan', 'Skylar', 'Gerald',
                  'Saul', 'Billy', 'Jesse', 'Helen'] * 1000000)
input_supplier = pd.DataFrame(names, columns=['Category Level - 3'])

len(input_supplier)
10000000

category = ['Emma', 'Ryan', 'Gerald', 'Billy', 'Helen']

Method 1 (note this method does not ignore case)

%%timeit
input_supplier['search'] = \
    input_supplier['Category Level - 3'].str.contains('|'.join(category))
df1 = input_supplier[input_supplier['search'] == True]

4.42 s ± 37.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Method 2

%%timeit
df2 = input_supplier[input_supplier['Category Level - 3'].str.contains(
    '|'.join(category), flags=re.IGNORECASE)]

5.45 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numpy method ignoring case:

%%timeit
lcase_vals = [x.lower() for x in input_supplier['Category Level - 3']]
category_lcase = [x.lower() for x in category]
df3 = input_supplier.iloc[np.where(np.isin(lcase_vals, category_lcase))[0]]

2.02 s ± 31.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numpy method if matching case:

%%timeit
col_vals = input_supplier['Category Level - 3'].values
df4 = input_supplier.iloc[np.where(np.isin(col_vals, category))[0]]

623 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM