简体   繁体   中英

Fuzzy regex match on million rows Pandas df

I am trying to check for fuzzy match between a string column and a reference list. The string series contains over 1 m rows and the reference list contains over 10 k entries.

For eg:

df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows

ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k rows

###Output should look like 

df['MATCH'] = pd.Series([Nan, 'XANDER', 'MANDER', 'PARIS', 'HARIS', Nan, 'PARIS', Nan])

It should generate match if the word appears separately in the string (and within that, upto 1 char substitution allowed)

For eg - 'PARIS' can match against ' PARIS HILTON', 'THE HARIS DOWNTOWN', but not against 'APARISIAN'.

Similarly, 'XANDER' matches against 'NOVA XANDER ' and 'SALA MANDER ' (MANDER being 1 char diff from XANDER), but does not generate match against 'ALEXANDERS'.

As of now, we have written the logic for the same (shown below), although the match takes over 4 hrs to run.. Need to get this to under 30 mins.

Current code -

tags_regex = ref_df['REF_NAMES'].tolist()
tags_ptn_regex = '|'.join([f'\s+{tag}\s+|^{tag}\s+|\s+{tag}$' for tag in tags_regex])


def search_it(partyname):
    m = regex.search("("+tags_ptn_regex+ ")"+"{s<=1:[A-Z]}",partyname):
    if m is not None:
        return m.group()
    else:
        return None
    
df['MATCH'] = df['NAMES'].str.apply(search_it)

Also, will multiprocessing help with regex? Many thanks in advance!

Your pattern is rather inefficient, as you repeat tag pattern thrice in the regex. You just need to create a pattern with the so-called whitespace boundaries, (?<!\S) and (?!\S) , and you will only need one tag pattern.

Next, if you have several thousands alternative, even the single tag pattern regex will be extremely slow because there can appear such alternatives that match at the same location in the string, and thus, there will be too much backtracking.

To reduce this backtracking, you will need a regex trie .

Here is a working snippet:

import regex
import pandas as pd

## Class to build a regex trie, see https://stackoverflow.com/a/42789508/3832970
class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return regex.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

## Start of main code
df = pd.DataFrame()
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df = pd.DataFrame()
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k row

trie = Trie()
for word in ref_df['REF_NAMES'].tolist():
    trie.add(word)

tags_ptn_regex = regex.compile(r"(?:(?<!\S)(?:{})(?!\S)){{s<=1:[A-Z]}}".format(trie.pattern()), regex.IGNORECASE)

def search_it(partyname):
    m = tags_ptn_regex.search(partyname)
    if m is not None:
        return m.group()
    else:
        return None
    
df['MATCH'] = df['NAMES'].apply(search_it)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM