是否可以與 python pandas 進行模糊匹配合並？

Question

我有兩個要基於列合並的 DataFrame。 但是，由於拼寫不同、空格數量不同、變音符號的缺失/存在，只要它們彼此相似，我希望能夠合並。

任何相似性算法都可以（soundex、Levenshtein、difflib's）。

說一台DataFrame有如下數據：

df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])

       number
one         1
two         2
three       3
four        4
five        5

df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

      letter
one        a
too        b
three      c
fours      d
five       e

然后我想得到結果 DataFrame

       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

Answer 1

與@locojay 建議類似，您可以將difflib的get_close_matches應用於df2的索引，然后應用join ：

In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]: 
      letter
one        a
two        b
three      c
four       d
five       e

In [31]: df1.join(df2)
Out[31]: 
       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

.

如果這些是列，同樣可以應用到列然后merge ：

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)

Answer 2

使用`fuzzywuzzy`

由於沒有包含fuzzywuzzy包的示例，這是我編寫的一個函數，它將根據您可以設置為用戶的閾值返回所有匹配項：

示例數據框

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

# df1
          Key
0       Apple
1      Banana
2      Orange
3  Strawberry

# df2
        Key
0      Aple
1     Mango
2      Orag
3     Straw
4  Bannanna
5     Berry

模糊匹配功能

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m
    
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    
    return df_1

在數據幀上使用我們的函數： #1

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)

          Key       matches
0       Apple          Aple
1      Banana      Bannanna
2      Orange          Orag
3  Strawberry  Straw, Berry

在數據幀上使用我們的函數： #2

df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})

fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)

        Col1  matches
0  Microsoft  Mcrsoft
1     Google    gogle
2     Amazon   Amason
3        IBM

安裝：

點

pip install fuzzywuzzy

蟒蛇

conda install -c conda-forge fuzzywuzzy

Answer 3

我寫了一個 Python 包來解決這個問題：

pip install fuzzymatcher

您可以在此處找到 repo 並在此處找到文檔。

基本用法：

給定要模糊連接的兩個數據幀df_left和df_right ，您可以編寫以下內容：

from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

或者，如果您只想鏈接最接近的匹配項：

fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)

Answer 4

我會使用 Jaro-Winkler，因為它是目前可用的性能最高、最准確的近似字符串匹配算法之一 [ Cohen, et al. ]、[溫克勒]。

這就是我使用jellyfish包中的 Jaro-Winkler 的方法：

def get_closest_match(x, list_strings):

  best_match = None
  highest_jw = 0

  for current_string in list_strings:
    current_score = jellyfish.jaro_winkler(x, current_string)

    if(current_score > highest_jw):
      highest_jw = current_score
      best_match = current_string

  return best_match

df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))

df1.join(df2)

輸出：

    number  letter
one     1   a
two     2   b
three   3   c
four    4   d
five    5   e

Answer 5

http://pandas.pydata.org/pandas-docs/dev/merging.html沒有鈎子函數來即時執行此操作。 雖然會很好...

我只會做一個單獨的步驟並使用 difflib getclosest_matches 在 2 個數據幀之一中創建一個新列，並在模糊匹配列上創建合並/連接

Answer 6

對於一般方法： `fuzzy_merge`

對於更一般的場景，我們希望合並來自兩個包含略有不同的字符串的數據幀的列，以下函數使用difflib.get_close_matches和merge以模仿熊貓merge的功能，但具有模糊匹配：

import difflib 

def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
    df_other= df2.copy()
    df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff) 
                         for x in df_other[right_on]]
    return df1.merge(df_other, on=left_on, how=how)

def get_closest_match(x, other, cutoff):
    matches = difflib.get_close_matches(x, other, cutoff=cutoff)
    return matches[0] if matches else None

以下是具有兩個示例數據幀的一些用例：

print(df1)

     key   number
0    one       1
1    two       2
2  three       3
3   four       4
4   five       5

print(df2)

                 key_close  letter
0                    three      c
1                      one      a
2                      too      b
3                    fours      d
4  a very different string      e

通過上面的例子，我們會得到：

fuzzy_merge(df1, df2, left_on='key', right_on='key_close')

     key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d

我們可以做一個左連接：

fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')

     key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d
4   five       5       NaN    NaN

對於右連接，我們將左側數據框中的所有非匹配鍵都設置為None ：

fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')

     key  number                key_close letter
0    one     1.0                      one      a
1    two     2.0                      too      b
2  three     3.0                    three      c
3   four     4.0                    fours      d
4   None     NaN  a very different string      e

另請注意，如果在截止范圍內沒有匹配項，則difflib.get_close_matches將返回一個空列表。 在共享示例中，如果我們將df2的最后一個索引更改為：

print(df2)

                          letter
one                          a
too                          b
three                        c
fours                        d
a very different string      e

我們會得到一個index out of range錯誤：

df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

IndexError：列表索引超出范圍

為了解決這個問題，上面的函數get_closest_match將通過索引由difflib.get_close_matches返回的列表來返回最接近的匹配，只有當它實際上包含任何匹配時。

Answer 7

我使用了 Fuzzymatcher 包，這對我來說效果很好。 訪問此鏈接了解更多詳情。

使用以下命令安裝

pip install fuzzymatcher

下面是示例代碼（上面已經由 RobinL 提交）

from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

您可能遇到的錯誤

ZeroDivisionError: float 除以零--->參考這個鏈接解決
OperationalError: No such Module:fts4 --> 從這里下載 sqlite3.dll 並替換 python 或 anaconda DLLs 文件夾中的 DLL 文件。

優點：

工作更快。 就我而言，我將一個包含 3000 行的數據框與另一個包含 170,000 條記錄的數據框進行了比較。 這也使用 SQLite3 跨文本搜索。 比很多都快
可以檢查多列和 2 個數據框。 就我而言，我正在根據地址和公司名稱尋找最接近的匹配項。 有時，公司名稱可能相同，但地址也是檢查的好東西。
為您提供相同記錄的所有最接近匹配的分數。 你選擇什么是截止分數。

缺點：

原包安裝有問題
還安裝了必需的 C++ 和 Visual Studio
不適用於 64 位 anaconda/Python

Answer 8

有一個叫做fuzzy_pandas的包可以使用levenshtein 、 jaro 、 metaphone和bilenco方法。 這里有一些很好的例子

import pandas as pd
import fuzzy_pandas as fpd

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

results = fpd.fuzzy_merge(df1, df2,
            left_on='Key',
            right_on='Key',
            method='levenshtein',
            threshold=0.6)

results.head()

  Key    Key
0 Apple  Aple
1 Banana Bannanna
2 Orange Orag

Answer 9

提醒一下，這基本上是有效的，除非找不到匹配項，或者任一列中都有 NaN。 我發現應用以下函數更容易，而不是直接應用get_close_matches 。 NaN 替換的選擇在很大程度上取決於您的數據集。

def fuzzy_match(a, b):
    left = '1' if pd.isnull(a) else a
    right = b.fillna('2')
    out = difflib.get_close_matches(left, right)
    return out[0] if out else np.NaN

Answer 10

你可以使用d6tjoin

import d6tjoin.top1
d6tjoin.top1.MergeTop1(df1.reset_index(),df2.reset_index(),
       fuzzy_left_on=['index'],fuzzy_right_on=['index']).merge()['merged']

index number index_right letter 0 one 1 one a 1 two 2 too b 2 three 3 three c 3 four 4 fours d 4 five 5 five e

它具有多種附加功能，例如：

檢查加入質量，加入前和加入后
自定義相似度函數，例如編輯距離 vs 漢明距離
指定最大距離
多核計算

詳情見

MergeTop1 示例- 最佳匹配連接示例筆記本
PreJoin 示例- 診斷連接問題的示例

Answer 11

使用`thefuzz`

使用 SeatGeek 出色的 package thefuzz ，它利用了 Levenshtein 距離。 這適用於列中保存的數據。 它將匹配項添加為行而不是列，以保留整潔的數據集，並允許將其他列輕松拉入 output dataframe。

樣本數據

df1 = pd.DataFrame({'col_a':['one','two','three','four','five'], 'col_b':[1, 2, 3, 4, 5]})

    col_a   col_b
0   one     1
1   two     2
2   three   3
3   four    4
4   five    5

df2 = pd.DataFrame({'col_a':['one','too','three','fours','five'], 'col_b':['a','b','c','d','e']})

    col_a   col_b
0   one     a
1   too     b
2   three   c
3   fours   d
4   five    e

Function 用來做配套

def fuzzy_match(
    df_left, df_right, column_left, column_right, threshold=90, limit=1
):
    # Create a series
    series_matches = df_left[column_left].apply(
        lambda x: process.extract(x, df_right[column_right], limit=limit)            # Creates a series with id from df_left and column name _column_left_, with _limit_ matches per item
    )

    # Convert matches to a tidy dataframe
    df_matches = series_matches.to_frame()
    df_matches = df_matches.explode(column_left)     # Convert list of matches to rows
    df_matches[
        ['match_string', 'match_score', 'df_right_id']
    ] = pd.DataFrame(df_matches[column_left].tolist(), index=df_matches.index)       # Convert match tuple to columns
    df_matches.drop(column_left, axis=1, inplace=True)      # Drop column of match tuples

    # Reset index, as in creating a tidy dataframe we've introduced multiple rows per id, so that no longer functions well as the index
    if df_matches.index.name:
        index_name = df_matches.index.name     # Stash index name
    else:
        index_name = 'index'        # Default used by pandas
    df_matches.reset_index(inplace=True)
    df_matches.rename(columns={index_name: 'df_left_id'}, inplace=True)       # The previous index has now become a column: rename for ease of reference

    # Drop matches below threshold
    df_matches.drop(
        df_matches.loc[df_matches['match_score'] < threshold].index,
        inplace=True
    )

    return df_matches

使用 function 並合並數據

import pandas as pd
from thefuzz import process

df_matches = fuzzy_match(
    df1,
    df2,
    'col_a',
    'col_a',
    threshold=60,
    limit=1
)

df_output = df1.merge(
    df_matches,
    how='left',
    left_index=True,
    right_on='df_left_id'
).merge(
    df2,
    how='left',
    left_on='df_right_id',
    right_index=True,
    suffixes=['_df1', '_df2']
)

df_output.set_index('df_left_id', inplace=True)       # For some reason the first merge operation wrecks the dataframe's index. Recreated from the value we have in the matches lookup table

df_output = df_output[['col_a_df1', 'col_b_df1', 'col_b_df2']]      # Drop columns used in the matching
df_output.index.name = 'id'

id  col_a_df1   col_b_df1   col_b_df2
0   one         1           a
1   two         2           b
2   three       3           c
3   four        4           d
4   five        5           e

提示：如果您也選擇安裝python-Levenshtein package，則使用thefuzz進行模糊匹配會快得多。

Answer 12

我以非常小的方式使用了fuzzywuzz ，同時匹配了pandas中merge的現有行為和關鍵字。

只需指定您接受的匹配threshold （介於0和100之間）：

from fuzzywuzzy import process

def fuzzy_merge(df, df2, on=None, left_on=None, right_on=None, how='inner', threshold=80):
    
    def fuzzy_apply(x, df, column, threshold=threshold):
        if type(x)!=str:
            return None
        
        match, score, *_ = process.extract(x, df[column], limit=1)[0]
            
        if score >= threshold:
            return match

        else:
            return None
    
    if on is not None:
        left_on = on
        right_on = on

    # create temp column as the best fuzzy match (or None!)
    df2['tmp'] = df2[right_on].apply(
        fuzzy_apply, 
        df=df, 
        column=left_on, 
        threshold=threshold
    )

    merged_df = df.merge(df2, how=how, left_on=left_on, right_on='tmp')
    
    del merged_df['tmp']
    
    return merged_df

使用示例數據嘗試一下：

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})

df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

fuzzy_merge(df, df2, on='Key', threshold=80)

Answer 13

對於將行與多列匹配的更復雜的用例，您可以使用recordlinkage包。 recordlinkage提供了所有工具來模糊匹配pandas數據框之間的行，這有助於在合並時對數據進行重復數據刪除。 我已經寫了有關包的詳細文章在這里

Answer 14

如果連接軸是數字，這也可以用於匹配具有指定容差的索引：

def fuzzy_left_join(df1, df2, tol=None):
    index1 = df1.index.values
    index2 = df2.index.values

    diff = np.abs(index1.reshape((-1, 1)) - index2)
    mask_j = np.argmin(diff, axis=1)  # min. of each column
    mask_i = np.arange(mask_j.shape[0])

    df1_ = df1.iloc[mask_i]
    df2_ = df2.iloc[mask_j]

    if tol is not None:
        mask = np.abs(df2_.index.values - df1_.index.values) <= tol
        df1_ = df1_.loc[mask]
        df2_ = df2_.loc[mask]

    df2_.index = df1_.index

    out = pd.concat([df1_, df2_], axis=1)
    return out

Answer 15

TheFuzz是新版本的fuzzywuzzy

為了模糊連接兩個大表中的字符串元素，您可以這樣做：

逐行使用適用於 go
使用 swifter 進行並行、加速和可視化默認應用 function（帶彩色進度條）
使用 collections 中的 OrderedDict 去除合並的 output 中的重復項並保持初始順序
增加 fuzz.process.extract 中的限制以查看更多合並選項（存儲在具有相似性百分比的元組列表中）

'*' 您可以使用thefuzz.process.extractOne而不是thefuzz.process.extract只返回一個最匹配的項目（不指定任何限制）。 但是，請注意，多個結果可能具有相同百分比的相似性，而您只會得到其中一個。

'**' 不知何故，swifter 在開始實際應用之前需要一兩分鍾。 如果你需要處理小表，你可以跳過這一步，直接使用progress_apply

 from thefuzz import process from collections import OrderedDict import swifter def match(x): matches = process.extract(x, df1, limit=6) matches = list(OrderedDict((x, True) for x in matches).keys()) print(f'{x:20}: {matches}') return str(matches) df1 = df['name'].values df2['matches'] = df2['name'].swifter.apply(lambda x: match(x))

是否可以與 python pandas 進行模糊匹配合並？

問題描述

15 個解決方案

解決方案1
96 已采納 2012-12-03 10:06:04

解決方案2
53 2019-05-26 16:42:53

使用`fuzzywuzzy`

安裝：

解決方案3
21 2017-12-02 09:15:43

解決方案4
13 2016-05-29 01:54:05

解決方案5
6 2012-11-30 19:56:02

解決方案6
5 2020-03-28 23:56:11

對於一般方法： `fuzzy_merge`

解決方案7
4 2019-07-12 20:18:11

解決方案8
3 2020-03-11 10:43:18

解決方案9
2 2014-08-07 18:33:26

解決方案10
2 2018-08-15 13:00:17

解決方案11
2 2022-03-28 10:59:43

使用`thefuzz`

解決方案12
1 2021-03-27 15:48:56

解決方案13
0 2020-11-19 06:23:56

解決方案14
0 2021-02-27 18:11:10

解決方案15
0 2022-08-07 19:49:43

是否可以與 python pandas 進行模糊匹配合並？

問題描述

15 個解決方案

解決方案1 96 已采納 2012-12-03 10:06:04

解決方案2 53 2019-05-26 16:42:53

使用fuzzywuzzy

安裝：

解決方案3 21 2017-12-02 09:15:43

解決方案4 13 2016-05-29 01:54:05

解決方案5 6 2012-11-30 19:56:02

解決方案6 5 2020-03-28 23:56:11

對於一般方法： fuzzy_merge

解決方案7 4 2019-07-12 20:18:11

解決方案8 3 2020-03-11 10:43:18

解決方案9 2 2014-08-07 18:33:26

解決方案10 2 2018-08-15 13:00:17

解決方案11 2 2022-03-28 10:59:43

使用thefuzz

解決方案12 1 2021-03-27 15:48:56

解決方案13 0 2020-11-19 06:23:56

解決方案14 0 2021-02-27 18:11:10

解決方案15 0 2022-08-07 19:49:43

解決方案1
96 已采納 2012-12-03 10:06:04

解決方案2
53 2019-05-26 16:42:53

使用`fuzzywuzzy`

解決方案3
21 2017-12-02 09:15:43

解決方案4
13 2016-05-29 01:54:05

解決方案5
6 2012-11-30 19:56:02

解決方案6
5 2020-03-28 23:56:11

對於一般方法： `fuzzy_merge`

解決方案7
4 2019-07-12 20:18:11

解決方案8
3 2020-03-11 10:43:18

解決方案9
2 2014-08-07 18:33:26

解決方案10
2 2018-08-15 13:00:17

解決方案11
2 2022-03-28 10:59:43

使用`thefuzz`

解決方案12
1 2021-03-27 15:48:56

解決方案13
0 2020-11-19 06:23:56

解決方案14
0 2021-02-27 18:11:10

解決方案15
0 2022-08-07 19:49:43