[英]Apply fuzzy string matching of two columns in two Pandas dataframes while preserving a similarity score and output a Pandas DataFrame
我有兩個要合並的數據框,基於公司名稱的主鍵和外鍵。 一個數據集有大約 50,000 個唯一的公司名稱,另一個數據集有大約 5,000 個。 每個列表中可以有重復的公司名稱。
為此,我嘗試遵循Figure out if a business name is very similar to another - Python中的第一個解決方案。 這是一個 MWE:
mwe1 = pd.DataFrame({'company_name': ['Deloitte',
'PriceWaterhouseCoopers',
'KPMG',
'Ernst & Young',
'intentionall typo company XYZ'
],
'revenue': [100, 200, 300, 250, 400]
}
)
mwe2 = pd.DataFrame({'salesforce_name': ['Deloite',
'PriceWaterhouseCooper'
],
'CEO': ['John', 'Jane']
}
)
我正在嘗試從Figure out if a business name is very similar to another 中獲取以下代碼 - Python以工作:
# token2frequency is just a word counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
return sum(1/token2frequency(t)**0.5 for t in seq)
def name_similarity(a, b, token2frequency):
a_tokens = set(a.split())
b_tokens = set(b.split())
a_uniq = sequence_uniqueness(a_tokens)
b_uniq = sequence_uniqueness(b_tokens)
return sequence_uniqueness(a.intersection(b))/(a_uniq * b_uniq) ** 0.5
我如何應用這兩個函數在mwe1
和mwe2
的每個可能組合之間產生相似度分數,然后過濾到最可能的匹配項?
例如,我正在尋找這樣的東西(我只是在similarity_score
列中彌補分數:
company_name revenue salesforce_name CEO similarity_score
Deloitte 100 Deloite John 98
PriceWaterhouseCoopers 200 Deloite John 0
KPMG 300 Deloite John 15
Ernst & Young 250 Deloite John 10
intentionall typo company XYZ 400 Deloite John 2
Deloitte 100 PriceWaterhouseCooper Jane 20
PriceWaterhouseCoopers 200 PriceWaterhouseCooper Jane 97
KPMG 300 PriceWaterhouseCooper Jane 5
Ernst & Young 250 PriceWaterhouseCooper Jane 7
intentionall typo company XYZ 400 PriceWaterhouseCooper Jane 3
如果您能想到的話,我也願意接受更好的最終狀態。 然后,我將過濾上面的表格以獲得類似的內容:
company_name revenue salesforce_name CEO similarity_score
Deloitte 100 Deloite John 98
PriceWaterhouseCoopers 200 PriceWaterhouseCooper Jane 97
這是我嘗試過的:
name_similarity(a = mwe1['company_name'], b = mwe2['salesforce_name'], token2frequency = 10)
AttributeError: 'Series' object has no attribute 'split'
我熟悉使用 lambda 函數,但不確定如何在遍歷兩個 Pandas 數據幀中的兩列時使其工作。
這是我使用 difflib 寫的class應該接近你需要的。
import difflib
import pandas as pd
class FuzzyMerge:
"""
Works like pandas merge except merges on approximate matches.
"""
def __init__(self, **kwargs):
self.left = kwargs.get("left")
self.right = kwargs.get("right")
self.left_on = kwargs.get("left_on")
self.right_on = kwargs.get("right_on")
self.how = kwargs.get("how", "inner")
self.cutoff = kwargs.get("cutoff", 0.8)
def merge(self) -> pd.DataFrame:
temp = self.right.copy()
temp[self.left_on] = [
self.get_closest_match(x, self.left[self.left_on]) for x in temp[self.right_on]
]
df = self.left.merge(temp, on=self.left_on, how=self.how)
df["similarity_percent"] = df.apply(lambda x: self.similarity_score(x[self.left_on], x[self.right_on]), axis=1)
return df
def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)
return matches[0] if matches else None
@staticmethod
def similarity_score(left: pd.Series, right: pd.Series) -> int:
return int(round(difflib.SequenceMatcher(a=left, b=right).ratio(), 2) * 100)
調用它:
df = FuzzyMerge(left=df1, right=df2, left_on="column from df1", right_on="column from df2", how="inner", cutoff=0.8).merge()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.