如何以更有效的方式计算用户相似度矩阵？

Question

I have a set of 10 users, each with their own folder/directories, containing 25-30 images shared by them (in some social media, say).我有一组 10 个用户，每个用户都有自己的文件夹/目录，包含他们共享的 25-30 张图像（例如在某些社交媒体中）。 I want to calculate the similarities between the users based on the images shared by them.我想根据用户共享的图像计算用户之间的相似度。

For that, I use a feature extractor to convert each image into a 224x224x3 array, then loop through each user and each of the images in their folders to find the cosine similarity between each pair images, then take the average of all those pairwise image similarities for each pair of users to find the user similarity.为此，我使用特征提取器将每个图像转换为 224x224x3 数组，然后遍历每个用户及其文件夹中的每个图像，以找到每对图像之间的余弦相似度，然后取所有这些成对图像相似度的平均值为每一对用户找到用户相似度。 (Please let me know if there's some mistake in this logic by the way). （顺便说一下，如果这个逻辑有什么错误，请告诉我）。

My code to do all this is as follows:我做这一切的代码如下：

from tensorflow.keras.applications.imagenet_utils import preprocess_input
from tensorflow.keras.applications import vgg16
from tensorflow.keras.preprocessing.image import load_img,img_to_array
from tensorflow.keras.models import Model

import os
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# load the model
vgg_model = vgg16.VGG16(weights='imagenet')

# remove the last layers in order to get features instead of predictions
feat_extractor = Model(inputs=vgg_model.input, outputs=vgg_model.get_layer("fc2").output)

def processed_image(image):
    original = load_img(image, target_size=(224, 224))
    numpy_image = img_to_array(original)
    image_batch = np.expand_dims(numpy_image, axis=0)
    processed_image = preprocess_input(image_batch.copy())
    img_features = feat_extractor.predict(processed_image)
    return img_features

def image_similarity(image1, image2):
    image1 = processed_image(image1)
    image2 = processed_image(image2)
    sim = cosine_similarity(image1, image2)
    return sim[0][0]

user_list = ['User '+str(i) for i in range(1,11)]
user_sim_df = pd.DataFrame(columns=user_list, index=user_list)
for user1 in user_list:
    for user2 in user_list:
        sum_img_sim = 0
        user1_files = [imgs_path + x for x in os.listdir('All_Users/'+user1) if "jpg" in x]
        user2_files = [imgs_path + x for x in os.listdir('All_Users/'+user2) if "jpg" in x]
        
        for image1 in user1_files:
            for image2 in user2_files:
                sum_img_sim += image_similarity(image1, image2)
        
        user_sim_df[user1][user2] = 2*sum_img_sim/(len(user1_files)+len(user2_files))

Now, because there are 4 for loops involved in calculating the user similarity matrix, the code take a long time too run (its been more than 30 minutes as of typing this question, that the code has been running for 10 users with 25-30 images each).现在，因为计算用户相似度矩阵涉及 4 个for循环，所以代码运行时间也很长（在输入这个问题时已经超过 30 分钟，代码已经为 25-30 的 10 个用户运行每个图像）。

So, how do I rewrite the last portion of this to make the code run faster?那么，如何重写最后一部分以使代码运行得更快呢？

Answer 1

Nested for loops are particularly bad for Python, but some work can be done to improve here.嵌套 for 循环对 Python 尤其不利，但可以在此处进行一些改进。

First of all, you are doing work twice in the comparisons.首先，您在比较中做了两次工作。 user_sim_df[user_i][user_j] has the same value as user_sim_df[user_j][user_i] for all pairs i, j . user_sim_df[user_i][user_j] user_sim_df[user_j][user_i]对于所有对i, j具有与 user_sim_df[user_j][user_i] 相同的值。 Could benefit from using already calculated values, instead of computing them again in later iterations.可以受益于使用已经计算的值，而不是在以后的迭代中再次计算它们。 Besides this, is computing the values on the diagonal ( user_sim_df[user_i][user_i] ) necessary for your application?除此之外，计算您的应用程序所需的对角线 ( user_sim_df[user_i][user_i] ) 上的值吗？

These simple changes will reduce execution time to half.这些简单的更改将执行时间减少一半。 Is that enough?够了吗？ Maybe not.也许不吧。 Further lines of improvement:进一步的改进：

the img_to_array() operation is being applied many times on every image (every time you calculate similarity with another one). img_to_array()操作在每张图像上多次应用（每次计算与另一张图像的相似度时）。 Is it a bottleneck?是瓶颈吗？ In that case, performance could also improve if you first run a loop on all images and create a new file ready for numpy to read later, for example with numpy.read() - or maybe, just save the preprocessed files output from the Tensorflow currently being used. In that case, performance could also improve if you first run a loop on all images and create a new file ready for numpy to read later, for example with numpy.read() - or maybe, just save the preprocessed files output from the Tensorflow目前正在使用。

if you're using the standard Python interpreter, changing to PyPy can help (in general).如果您使用的是标准 Python 解释器，则更改为 PyPy 会有所帮助（通常）。 You could also try adapting the code to consist only of operations on numpy structures (eg adapt the pandas parts) and use Numba in a way similar to this SO link .您还可以尝试调整代码以仅包含对 numpy 结构的操作（例如调整 pandas 部分）并以类似于此 SO 链接的方式使用 Numba。 Using Numba you can also benefit from parallelism.使用 Numba，您还可以从并行性中受益。 See some practical guidelines here .请参阅此处的一些实用指南。

如何以更有效的方式计算用户相似度矩阵？

问题描述

1 个解决方案

解决方案1
1 2020-07-02 08:18:32

如何以更有效的方式计算用户相似度矩阵？

问题描述

1 个解决方案

解决方案1 1 2020-07-02 08:18:32

解决方案1
1 2020-07-02 08:18:32