從一大組 python 列表中，找到具有最多共同元素的 2 個列表 - Python / Pandas

Question

問題：

我有許多不同長度的 python 列表，所有列表都只包含整數。

如何找到具有最多共同元素的 2 個列表？

示例輸入：

list1 = [234, 982, 908, 207, 456, 284, 473]
list2 = [845, 345, 765, 678]
list3 = [120, 542, 764, 908, 217, 778, 999, 326, 456]

# thousands more lists ...

例如 output：

# 2 most similar lists:

list400, list6734

注意：我不希望在列表中找到最常見的元素，只尋找最相似的 2 個列表，即具有最多共同元素。 我也不關心單個元素的相似性。 一個元素要么在 2 個列表中找到，要么不在。

語境：

我有一個數據集，表示哪些用戶喜歡平台上的某些帖子。 如果 2 個用戶都喜歡同一個帖子，他們的 common_like_score 為 1。如果 2 個用戶喜歡 10 個相同的帖子，他們的共同點贊得分為 10。我的目標是找到 common_like_score 最高的 2 個用戶

我已將數據（來自 CSV）加載到 pandas 數據幀中，每一行代表用戶和帖子之間的類似：

指數	用戶身份	帖子編號
0	201	234
1	892	908
2	300	825

等數千行。

我的方法是按用戶 ID 對數據進行分組，然后將每個用戶的帖子作為逗號分隔的字符串連接到單個行/列中：

df = df[df['user id'].duplicated(keep=False)]
df['post id'] = df['post id'].astype(str)
df['liked posts'] = df.groupby('user id')['post id'].transform(lambda x: ','.join(x))
df = df.drop(columns=['post id'])
df = df[['user id', 'liked posts']].drop_duplicates()

生成的 dataframe：

指數	用戶身份	喜歡的帖子
0	201	234,789,267, ...
1	892	908,734,123, ...
2	300	825,456,765, ...

等等，其中用戶 ID 是唯一的行...

因此我的問題是——我需要找出數據框的哪些行在喜歡的帖子列中具有最多的共同數字，以返回具有最高 common_like_score 的 2 個用戶。

如果整體問題有更好的解決方案，也請分享。

Answer 1

由於這些是可變長度列表，因此 pandas（喜歡使用類似大小的列）可能不是最好的工具。 在直接的 python 中，您可以將這些列表轉換為集合，然后使用集合交集計數來查找最多的共同項。 這對於集合來說是最常見的，如果列表包含相同 integer 的多個副本，則該集合可能與列表不同。

我想出的代碼有點復雜，因為它只進行一次到集合的中間轉換。 但我認為它是可讀的......（我希望）。

import itertools

list1 = [234, 982, 908, 207, 456, 284, 473]
list2 = [845, 345, 765, 678]
list3 = [120, 542, 764, 908, 217, 778, 999, 326, 456]
lists = [list1, list2, list3]

# intermediate sets for comparison
sets = [(set(l),i) for i,l in enumerate(lists)]

# list of in-common count for each combo of two "sets"
combos = [(len(a[0] & b[0]), a, b) for a,b in itertools.combinations(sets, 2)]
combos.sort()

# most in-common "sets"
most = combos[-1][1:]

# dereferenced
most1, most2 = lists[most[0][1]], lists[most[1][1]]

print(most1, most2)

Answer 2

這似乎是圖論的完美應用。 如果我們將每個用戶和帖子想象為一個節點，並將每個用戶連接到他們喜歡的帖子，我們可以在 Graph 上應用轉換來獲得一個矩陣，該矩陣同時顯示所有用戶的common_like_score 。 在這一點上，獲得最高分的對應該是微不足道的。

“找到圖中兩個節點之間的 n 長度路徑的數量”是圖論中一個眾所周知且已解決的問題。

以下是一些很好的參考鏈接，可用於了解有關該理論的更多信息：

基本上，如果您將此數據表示為鄰接矩陣，那么您可以對該矩陣進行平方以獲得每個用戶的喜愛度得分！

這是我的實現：

import numpy as np

#Create an adjacency matrix
users = df["user id"].unique()
combined = np.concatenate((users, df["post id"].unique()), axis=0)
combined_dict = dict()
i = 0
for c in combined:
    combined_dict[c] = i
    i += 1

n = len(combined)-1
M = np.zeros((n+1, n+1))

for pair in df.itertuples():
    M[combined_dict[pair._1], combined_dict[pair._2]] = 1
    M[combined_dict[pair._2], combined_dict[pair._1]] = 1
M = np.asmatrix(M)


#Square the matrix to get scores of all users
scores = M*M

#Slice matrix to only include users
user_count = len(users)
scores = scores[0:user_count, 0:user_count]

#Remove paths between same user
for i in range(user_count):
    scores[i, i] = 0

print(scores)

運行它會產生一個看起來像這樣的矩陣（沒有用戶標簽）：

	用戶 1	用戶 2	用戶 3	用戶 4
用戶 1	0	2	1	0
用戶 2	2	0	1	1
用戶 3	1	1	0	0
用戶 4	0	1	0	0

在這一點上，找到最大用戶對應該是微不足道的（我將最大分數加粗為 2）。 作為額外的獎勵，您還可以獲得任何用戶對的分數，您可以隨意查詢它們。

Answer 3

由於您只關心公共元素的數量，因此您可以創建虛擬對象，然后使用點積來獲取所有用戶比較的共享元素的數量。 然后我們找到最大值。

樣本數據

import pandas as pd
import numpy as np

df = pd.DataFrame({'user id': list('ABCD'),
                   'liked_posts': ['12,14,141', '12,14,141,151',
                                   '1,14,151,15,1511', '2,4,1411,141']})

代碼

# Dummies for each post
arr = df['liked_posts'].str.get_dummies(sep=',')
#   1  12  14  141  1411  15  151  1511  2  4
#0  0   1   1    1     0   0    0     0  0  0
#1  0   1   1    1     0   0    1     0  0  0
#2  1   0   1    0     0   1    1     1  0  0
#3  0   0   0    1     1   0    0     0  1  1 

# Shared counts. By definition symmetric
# diagonal -> 0 makes sure we don't consider same user with themself.
arr = np.dot(arr, arr.T)
np.fill_diagonal(arr, 0)

df1 = pd.DataFrame(index=df['user id'], columns=df['user id'], data=arr)
#user id  A  B  C  D
#user id            
#A        0  3  1  1
#B        3  0  2  1
#C        1  2  0  0
#D        1  1  0  0

# Find user pair with the most shared (will only return 1st pair in case of Ties)
df1.stack().idxmax()
#('A', 'B')

從一大組 python 列表中，找到具有最多共同元素的 2 個列表 - Python / Pandas

問題描述

3 個解決方案

解決方案1
4 2021-05-13 16:49:42

解決方案2
2 2021-05-13 18:01:42

解決方案3
1 2021-05-13 16:58:53

樣本數據

代碼

從一大組 python 列表中，找到具有最多共同元素的 2 個列表 - Python / Pandas

問題描述

3 個解決方案

解決方案1 4 2021-05-13 16:49:42

解決方案2 2 2021-05-13 18:01:42

解決方案3 1 2021-05-13 16:58:53

樣本數據

代碼

解決方案1
4 2021-05-13 16:49:42

解決方案2
2 2021-05-13 18:01:42

解決方案3
1 2021-05-13 16:58:53