找到兩個不同長度的DataFrame之間的相似性

Question

我有兩個不同長度的Pandas Dataframe。 DF1有大約120萬行（只有1列），DF2有大約300,000行（和一列），我試圖從兩個列表中找到類似的項目。

DF1擁有約75％的公司名稱和25％的人，而DF2則相反，但它們都是字母數字。 我想要的是編寫一個功能，突出顯示兩個列表中最相似的項目，按分數（或百分比）排名。 例如，

Apple -> Apple Inc. (0.95) 
Apple -> Applebees (0.68)
Banana Boat -> Banana Bread (0.25)

到目前為止，我已經嘗試了兩種方法，這兩種方法都失敗了。

方法1 ：找到兩個列表的Jaccard系數。

import numpy as np
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(df_1, df_2)

這不起作用，可能是由於兩個數據幀的長度不同而我得到了這個錯誤：

ValueError：找到具有不一致樣本數的數組

方法2 ::使用序列匹配器

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

然后調用Dataframes：

similar(df_1, df_2)

這會導致錯誤：

pandas.index.IndexEngine.get_loc中的pandas / index.pyx（pandas / index.c：3979）（）

pandas.index.IndexEngine.get_loc中的pandas / index.pyx（pandas / index.c：3843）（）

pandas.hashtable.PyObjectHashTable.get_item中的pandas / hashtable.pyx（pandas / hashtable.c：12265）（）

pandas.hashtable.PyObjectHashTable.get_item中的pandas / hashtable.pyx（pandas / hashtable.c：12216）（）

KeyError：0

我怎么能解決這個問題？

Answer 1

解

我不得不安裝distance模塊，因為它比在此上下文中找出如何使用jaccard_similarity_score更快。 我無法從該功能重新創建您的號碼。

安裝`distance`

pip install distance

使用`distance`

import distance

jd = lambda x, y: 1 - distance.jaccard(x, y)
df_1.head().iloc[:, 0].apply(lambda x: df_2.head().iloc[:, 0].apply(lambda y: jd(x, y)))

head()在那里為您提供保護。 我很確定刪除它們會炸毀你的計算機，因為它會產生1.2MX 0.3M矩陣。

嘗試這個。 我不太確定你到底想要什么。 我們可以隨着您的清晰度進行調整。

Answer 2

或者比較僅限於同一元素位置的項目。

import distance

jd = lambda x, y: 1 - distance.jaccard(x, y)

test_df = pd.concat([df.iloc[:, 0] for df in [df_1, df_2]], axis=1, keys=['one', 'two'])
test_df.apply(lambda x: jd(x[0], x[1]), axis=1)

找到兩個不同長度的DataFrame之間的相似性

問題描述

2 個解決方案

解決方案1
0 2016-05-06 17:13:32

解

安裝`distance`

使用`distance`

解決方案2
0 2016-05-10 18:04:04

找到兩個不同長度的DataFrame之間的相似性

問題描述

2 個解決方案

解決方案1 0 2016-05-06 17:13:32

解

安裝distance

使用distance

解決方案2 0 2016-05-10 18:04:04

解決方案1
0 2016-05-06 17:13:32

安裝`distance`

使用`distance`

解決方案2
0 2016-05-10 18:04:04