在兩個 Pandas 數據幀中應用兩列的模糊字符串匹配，同時保留相似性得分和 output a Pandas DataFrame

Question

我有兩個要合並的數據框，基於公司名稱的主鍵和外鍵。 一個數據集有大約 50,000 個唯一的公司名稱，另一個數據集有大約 5,000 個。 每個列表中可以有重復的公司名稱。

為此，我嘗試遵循Figure out if a business name is very similar to another - Python中的第一個解決方案。 這是一個 MWE：

mwe1 = pd.DataFrame({'company_name': ['Deloitte', 
                                      'PriceWaterhouseCoopers', 
                                      'KPMG',
                                      'Ernst & Young',
                                      'intentionall typo company XYZ'
                                     ],
                    'revenue': [100, 200, 300, 250, 400]
                   }
                  )

mwe2 = pd.DataFrame({'salesforce_name': ['Deloite',
                                         'PriceWaterhouseCooper'
                                        ],
                     'CEO': ['John', 'Jane']
                    }
                   )

我正在嘗試從Figure out if a business name is very similar to another 中獲取以下代碼 - Python以工作：

# token2frequency is just a word counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
    return sum(1/token2frequency(t)**0.5 for t in seq)

def name_similarity(a, b, token2frequency):
    a_tokens = set(a.split())
    b_tokens = set(b.split())
    a_uniq = sequence_uniqueness(a_tokens)
    b_uniq = sequence_uniqueness(b_tokens)
    return sequence_uniqueness(a.intersection(b))/(a_uniq * b_uniq) ** 0.5

我如何應用這兩個函數在mwe1和mwe2的每個可能組合之間產生相似度分數，然后過濾到最可能的匹配項？

例如，我正在尋找這樣的東西（我只是在similarity_score列中彌補分數：

company_name                   revenue    salesforce_name         CEO     similarity_score
Deloitte                       100        Deloite                 John    98
PriceWaterhouseCoopers         200        Deloite                 John    0
KPMG                           300        Deloite                 John    15
Ernst & Young                  250        Deloite                 John    10
intentionall typo company XYZ  400        Deloite                 John    2
Deloitte                       100        PriceWaterhouseCooper   Jane    20
PriceWaterhouseCoopers         200        PriceWaterhouseCooper   Jane    97
KPMG                           300        PriceWaterhouseCooper   Jane    5
Ernst & Young                  250        PriceWaterhouseCooper   Jane    7
intentionall typo company XYZ  400        PriceWaterhouseCooper   Jane    3

如果您能想到的話，我也願意接受更好的最終狀態。 然后，我將過濾上面的表格以獲得類似的內容：

company_name                   revenue    salesforce_name         CEO     similarity_score
Deloitte                       100        Deloite                 John    98
PriceWaterhouseCoopers         200        PriceWaterhouseCooper   Jane    97

這是我嘗試過的：

name_similarity(a = mwe1['company_name'], b = mwe2['salesforce_name'], token2frequency = 10)
AttributeError: 'Series' object has no attribute 'split'

我熟悉使用 lambda 函數，但不確定如何在遍歷兩個 Pandas 數據幀中的兩列時使其工作。

Answer 1

這是我使用 difflib 寫的class應該接近你需要的。

import difflib

import pandas as pd


class FuzzyMerge:
    """
    Works like pandas merge except merges on approximate matches.
    """
    def __init__(self, **kwargs):
        self.left = kwargs.get("left")
        self.right = kwargs.get("right")
        self.left_on = kwargs.get("left_on")
        self.right_on = kwargs.get("right_on")
        self.how = kwargs.get("how", "inner")
        self.cutoff = kwargs.get("cutoff", 0.8)

    def merge(self) -> pd.DataFrame:
        temp = self.right.copy()
        temp[self.left_on] = [
            self.get_closest_match(x, self.left[self.left_on]) for x in temp[self.right_on]
        ]

        df = self.left.merge(temp, on=self.left_on, how=self.how)
        df["similarity_percent"] = df.apply(lambda x: self.similarity_score(x[self.left_on], x[self.right_on]), axis=1)

        return df

    def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
        matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)

        return matches[0] if matches else None

    @staticmethod
    def similarity_score(left: pd.Series, right: pd.Series) -> int:
        return int(round(difflib.SequenceMatcher(a=left, b=right).ratio(), 2) * 100)

調用它：

df = FuzzyMerge(left=df1, right=df2, left_on="column from df1", right_on="column from df2", how="inner", cutoff=0.8).merge()

在兩個 Pandas 數據幀中應用兩列的模糊字符串匹配，同時保留相似性得分和 output a Pandas DataFrame

問題描述

1 個解決方案

解決方案1
0 2022-11-30 21:41:34

在兩個 Pandas 數據幀中應用兩列的模糊字符串匹配，同時保留相似性得分和 output a Pandas DataFrame

問題描述

1 個解決方案

解決方案1 0 2022-11-30 21:41:34

解決方案1
0 2022-11-30 21:41:34