简体   繁体   中英

From a large group of python lists, Find the 2 lists which have the most elements in common - Python / Pandas

Question:

I have many python lists of varying length, all containing only integers.

How can I find the 2 lists that have the most elements in common?

example input:

list1 = [234, 982, 908, 207, 456, 284, 473]
list2 = [845, 345, 765, 678]
list3 = [120, 542, 764, 908, 217, 778, 999, 326, 456]

# thousands more lists ...

example output:

# 2 most similar lists:

list400, list6734

Note: I am NOT looking to find the most common elements across the lists, only which 2 lists are the most similar ie have the most elements in common. I am also not concerned with the similarity of individual elements. An element is either found within 2 lists, or it isn't.

Context:

I have a data set representing which users have liked certain posts on a platform. If 2 users both like the same post, they have a common_like_score of 1. if 2 users like 10 of the same posts, they have a common like score of 10. My goal is to find the 2 users with the highest common_like_score

I have loaded the data (from CSV) into a pandas data frame, each row represents a like between the user and the post:

index user id post id
0 201 234
1 892 908
2 300 825

etc. thousands more rows.

My approach has been to group the data by user id and then concat the posts for each user into a single row / column as a comma separated string:

df = df[df['user id'].duplicated(keep=False)]
df['post id'] = df['post id'].astype(str)
df['liked posts'] = df.groupby('user id')['post id'].transform(lambda x: ','.join(x))
df = df.drop(columns=['post id'])
df = df[['user id', 'liked posts']].drop_duplicates()

The resulting dataframe:

index user id liked posts
0 201 234,789,267, ...
1 892 908,734,123, ...
2 300 825,456,765, ...

etc. where user id is a unique row...

Hence my question - I need to find which rows of the data frame have the most numbers in common in the liked posts column, to return the 2 users with the highest common_like_score.

If there is a better solution to the overall problem, please also share it.

Since these are variable length lists, pandas (which likes to work with like-sized columns) may not be the best tool. In straight python, you could convert those lists to sets, then use set intersection count to find the most items in common. This would be most in-common for the set, which can be different than the list if the list contains multiple copies of the same integer.

The code I came up with is a bit complex because it does the intermediate conversion to sets only once. But I think its readable... (I hope).

import itertools

list1 = [234, 982, 908, 207, 456, 284, 473]
list2 = [845, 345, 765, 678]
list3 = [120, 542, 764, 908, 217, 778, 999, 326, 456]
lists = [list1, list2, list3]

# intermediate sets for comparison
sets = [(set(l),i) for i,l in enumerate(lists)]

# list of in-common count for each combo of two "sets"
combos = [(len(a[0] & b[0]), a, b) for a,b in itertools.combinations(sets, 2)]
combos.sort()

# most in-common "sets"
most = combos[-1][1:]

# dereferenced
most1, most2 = lists[most[0][1]], lists[most[1][1]]

print(most1, most2)

This seems to be a perfect application for Graph Theory. If we imagine each user and post as a node and connect each user to the post they have liked, we can apply transformations on the Graph to get a matrix that shows the common_like_score for all of the users at once. At that point, it should be trivial to get the pair with the highest score.

" Finding the number of paths of n-length between two nodes in a graph " is a well-known and solved problem in Graph Theory.

Here are some good reference links to learn more about the theory:

  1. https://www.geeksforgeeks.org/find-the-number-of-paths-of-length-k-in-a-directed-graph/
  2. https://cp-algorithms.com/graph/fixed_length_paths.html

Basically, if you represent this data as an adjacency matrix , then you can square that matrix to get the likability score of every user!

Here is my implementation:

import numpy as np

#Create an adjacency matrix
users = df["user id"].unique()
combined = np.concatenate((users, df["post id"].unique()), axis=0)
combined_dict = dict()
i = 0
for c in combined:
    combined_dict[c] = i
    i += 1

n = len(combined)-1
M = np.zeros((n+1, n+1))

for pair in df.itertuples():
    M[combined_dict[pair._1], combined_dict[pair._2]] = 1
    M[combined_dict[pair._2], combined_dict[pair._1]] = 1
M = np.asmatrix(M)


#Square the matrix to get scores of all users
scores = M*M

#Slice matrix to only include users
user_count = len(users)
scores = scores[0:user_count, 0:user_count]

#Remove paths between same user
for i in range(user_count):
    scores[i, i] = 0

print(scores)

Running this will produce a matrix that looks like this (without the user labels):

User 1 User 2 User 3 User 4
User 1 0 2 1 0
User 2 2 0 1 1
User 3 1 1 0 0
User 4 0 1 0 0

At this point, finding the maximum user pair should be trivial (I bolded the max score of 2). As an added bouns, you also have the scores for any user pair and you can query them abritarily.

Since you only care about the counts of common elements you can create the dummies then use a dot product to get the number of shared elements for all user comparisons. Then we find the max.

Sample Data

import pandas as pd
import numpy as np

df = pd.DataFrame({'user id': list('ABCD'),
                   'liked_posts': ['12,14,141', '12,14,141,151',
                                   '1,14,151,15,1511', '2,4,1411,141']})

Code

# Dummies for each post
arr = df['liked_posts'].str.get_dummies(sep=',')
#   1  12  14  141  1411  15  151  1511  2  4
#0  0   1   1    1     0   0    0     0  0  0
#1  0   1   1    1     0   0    1     0  0  0
#2  1   0   1    0     0   1    1     1  0  0
#3  0   0   0    1     1   0    0     0  1  1 

# Shared counts. By definition symmetric
# diagonal -> 0 makes sure we don't consider same user with themself.
arr = np.dot(arr, arr.T)
np.fill_diagonal(arr, 0)

df1 = pd.DataFrame(index=df['user id'], columns=df['user id'], data=arr)
#user id  A  B  C  D
#user id            
#A        0  3  1  1
#B        3  0  2  1
#C        1  2  0  0
#D        1  1  0  0

# Find user pair with the most shared (will only return 1st pair in case of Ties)
df1.stack().idxmax()
#('A', 'B')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM