How to increase the speed of using fuzzy matching in dataframe?

Question

I want to use fuzzy matching to check if dataframe contain keywords.

However, it is very slow to use apply .

Are there any faster methods?

Can we use str or re ?

import regex

result = df['sentence'].apply(lambda x: regex.compile('(keyword){e<4}').findall(x)) #slow

Thank you very much.

Answer 1

Why're you compiling inside the apply? That literally defeats its purpose. Also, the best way to speed up an apply call is to not use apply .

Without context to what you're actually trying to match, I present to you:

p = regex.compile('(keyword){e<4}')
result = [p.findall(x) for x in df['sentence']]

My tests show that a list comprehension based regex match supersedes str methods in terms of performance. Well, take that with a grain of salt, because it always depends on your data and what you're trying to match.

You may want to consider using re.search instead of findall if you just want a single match (for more performance).

How to increase the speed of using fuzzy matching in dataframe?

Question

1 answers

solution1
2 ACCPTED 2018-06-22 17:29:06

How to increase the speed of using fuzzy matching in dataframe?

Question

1 answers

solution1 2 ACCPTED 2018-06-22 17:29:06

solution1
2 ACCPTED 2018-06-22 17:29:06