按相似关系过滤图像列表

Question

I have a list of images names and a (thresholded) similarity matrix for them.我有一个图像名称列表和一个（阈值）相似度矩阵。 The similarity relationship is reflexive and symmetric but not necessary transitive, ie if image_i is similar to image_j and to image_k , then it doesn't necessary mean that image_j and image_k are similar.相似关系是自反和对称的，但不一定是可传递的，即如果image_i与image_j和image_k相似，则不一定意味着image_j和image_k相似。

For example:例如：

images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']

sm = np.array([[1, 1, 1, 0, 1],
               [1, 1, 0, 0, 1],
               [1, 0, 1, 0, 0],
               [0, 0, 0, 1, 0],
               [1, 1, 0, 0, 1]])

The similarity matrix sm is interpreted as follows: if sm[i, j] == 1 then image_i and image_j are similar, otherwise they are not similar.相似度矩阵sm解释如下：如果sm[i, j] == 1则image_i和image_j相似，否则它们不相似。 Here we see that image_0 is similar to image_1 and image_2 , but image_1 and image_2 are not similar (this is just one example of non-transitivity).这里我们看到image_0类似于image_1和image_2 ，但image_1和image_2不相似（这只是非传递性的一个例子）。

I want to keep the maximum number of unique images (that are all pairwise non-similar according to the given sm matrix).我想保留最大数量的唯一图像（根据给定的sm矩阵，它们都是成对非相似的）。 For this example it would be [image_2, image_3, image_4] or [image_1, image_2, image_3] (in general there are multiple such subsets but I don't mind which to keep as long as they are of maximum length).对于此示例，它将是[image_2, image_3, image_4]或[image_1, image_2, image_3] （通常有多个这样的子集，但我不介意保留哪个，只要它们是最大长度）。 I am looking for an efficient way to do this since I have thousands of images.我正在寻找一种有效的方法来做到这一点，因为我有成千上万的图像。

Edit : My original solution was the following编辑：我原来的解决方案如下

np.array(images)[np.tril(sm).sum(0) == 1]

However it's not guaranteed that it will return a maximun length subset .但是不能保证它会返回一个最大长度的子集。 Consider the following example:考虑以下示例：

sm = np.array([[1, 1, 0, 0, 0],
               [1, 1, 0, 0, 0],
               [0, 0, 1, 1, 0],
               [0, 0, 1, 1, 1],
               [0, 0, 0, 1, 1]])

This solution will return ['image_1', 'image_4'] , whereas the desired result is ['image_0', 'image_2', 'image_4'] or ['image_1', 'image_2', 'image_4'] .此解决方案将返回['image_1', 'image_4'] ，而所需的结果是['image_0', 'image_2', 'image_4']或['image_1', 'image_2', 'image_4'] 。

Update : Please see my answer which explains the problem in more detail using graph theory.更新：请参阅我的回答，它使用图论更详细地解释了问题。 I am still open to suggestions since I haven't found a reasonably fast way to achieve the result for a list of thousands of images.我仍然愿意接受建议，因为我还没有找到一种相当快速的方法来实现数千张图像列表的结果。

Answer 1

After researching it a little bit more, I found that this is the so called maximum independent set problem in graph theory, which is unfortunately NP-hard.稍微研究了一下，发现这就是图论中所谓的最大独立集问题，可惜是NP-hard问题。

An independent set S of a graph G is a subset of vertices of G, such that no vertices in S are adjacent to one another.图 G 的独立集合S 是 G 的顶点的子集，因此 S 中没有顶点彼此相邻。 In our case, we are looking to find a maximum independent set (MIS), ie an independent set with the largest possible number of vertices.在我们的例子中，我们正在寻找一个最大独立集（MIS），即具有最大可能顶点数的独立集。

There are a couple of libraries for working with graphs and networks, such as igraph or NetworkX , which have functions for finding maximum independent sets.有几个用于处理图形和网络的库，例如igraph或NetworkX ，它们具有查找最大独立集的功能。 I ended up using igraph.我最终使用了 igraph。

For my problem, we can think of the images as vertices of a graph G and the "similarity matrix" as the adjacency matrix:对于我的问题，我们可以将图像视为图 G 的顶点，将“相似度矩阵”视为邻接矩阵：

images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']

sm = np.array([[1, 1, 1, 0, 1],
               [1, 1, 0, 0, 1],
               [1, 0, 1, 0, 0],
               [0, 0, 0, 1, 0],
               [1, 1, 0, 0, 1]])

# Adjacency matrix
adj = sm.copy()
np.fill_diagonal(adj, 0)

# Create the graph
import igraph
g = igraph.Graph.Adjacency(adj.tolist(), mode='UNDIRECTED')

# Find the maximum independent sets
g.largest_independent_vertex_sets()
[(1, 2, 3), (2, 3, 4)]

Unfortunately this is too slow for thousands of images (vertices).不幸的是，这对于数千个图像（顶点）来说太慢了。 So I am still open to suggestions for a faster way to do it (perhaps instead of finding all the MIS, just find one).所以我仍然愿意接受关于更快的方法的建议（也许不是找到所有的 MIS，而是找到一个）。

Note : the proposed solutions by @Sergey (UPDATE#1) and @marke don't always return a MIS -- they are greedy approximate algorithms which delete a vertex of maximum degree until no edge remains.注意：@Sergey (UPDATE#1) 和 @marke 提出的解决方案并不总是返回一个 MIS——它们是贪婪的近似算法，它们删除最大度数的顶点，直到没有边缘为止。 To demonstrate this, consider the following example:为了证明这一点，请考虑以下示例：

sm = np.array([[1, 1, 0, 0, 0, 1],
               [1, 1, 0, 1, 0, 0],
               [0, 0, 1, 1, 1, 0],
               [0, 1, 1, 1, 0, 0],
               [0, 0, 1, 0, 1, 1],
               [1, 0, 0, 0, 1, 1]])

Both solutions return [3, 5] , but for this example the maximum independent sets are two, [(0, 3, 4), (1, 2, 5)] , as are correctly found by igraph .两种解决方案都返回[3, 5] ，但对于此示例，最大独立集为两个[(0, 3, 4), (1, 2, 5)] ，正如igraph正确找到的那样。 To see why these solutions fail to find the MIS, below is a gif that shows how the vertices and edges are removed at each iteration (which is the "side effect" of np.argmax returning the first occurrence for multiple occurrences of the maximum value):要了解为什么这些解决方案无法找到 MIS，下面是一个 gif，显示了如何在每次迭代中删除顶点和边（这是np.argmax的“副作用”，返回第一次出现多次出现的最大值)：

The Sergey's solution (UPDATE#2) seems to work, however it is much much slower than the igraph's largest_independent_vertex_sets() . Sergey 的解决方案（UPDATE#2）似乎有效，但它比 igraph 的largest_independent_vertex_sets()慢得多。 For speed comparison you can use the following randomly generated similarity matrix of length 100:对于速度比较，您可以使用以下随机生成的长度为 100 的相似度矩阵：

a = np.random.randint(2, size=(100, 100))

# create a symmetric similarity matrix
sm = np.tril(a) + np.tril(a, -1).T  
np.fill_diagonal(sm, 1)  

# create adjacency matrix for igraph
adj = sm.copy()
np.fill_diagonal(adj, 0)

Update : it turns out that although I have thousands of images - vertices, the number of edges is relatively small (ie I have a sparse graph), so using igraph to find MIS is acceptable it terms of speed.更新：事实证明，尽管我有数千个图像 - 顶点，但边的数量相对较少（即我有一个稀疏图），因此就速度而言，使用 igraph 查找 MIS 是可以接受的。 Alternatively, as a compromise, one could use a greedy approximate algorithm for finding a large independent set (or a MIS if lucky enough).或者，作为一种折衷方案，可以使用贪心近似算法来寻找一个大的独立集（如果足够幸运的话，也可以使用 MIS）。 Below is an algorithm which seems pretty fast:下面是一个看起来相当快的算法：

def independent_set(adj):
    ''' 
    Given adjacency matrix, returns an independent set
    of size >= np.sum(1/(1 + adj.sum(0)))
    '''
    adj = np.array(adj, dtype=bool).astype(np.uint8)
    np.fill_diagonal(adj, 1)  # for the purposes of algorithm

    indep_set = set(range(len(adj)))
    # Loop until no edges remain
    while adj.sum(0).max() > 1: 
        degrees = adj.sum(0)
        # Randomly pick a vertex v of max degree
        v = random.choice(np.where(degrees == degrees.max())[0])
        # "Remove" the vertex v and the edges to its neigbours
        adj[v, :], adj[:, v] = 0, 0      
        # Update the maximal independent set
        indep_set.difference_update({v})
    return indep_set

Or even better, we can get a maximal independent set:或者更好的是，我们可以得到一个最大的独立集：

def maximal_independent_set(adj):  
    adj = np.array(adj, dtype=bool).astype(np.uint8)
    degrees = adj.sum(0)
    V = set(range(len(adj)))  # vertices of the graph
    mis = set()  # maximal independent set
    while V:
        # Randomly pick a vertex of min degree
        v = random.choice(np.where(degrees == degrees.min())[0])
        # Add it to the mis and remove it and its neighbours from V
        mis.add(v)
        Nv_c = set(np.nonzero(adj[v])[0]).union({v})  # closed neighbourhood of v
        V.difference_update(Nv_c)
        degrees[list(Nv_c)] = len(adj) + 1
    return mis

Answer 2

As I understand it, unique images are those that are not like any others.据我了解，独特的图像是那些与其他图像不同的图像。 If this is the case, then we can summarize the rows (or columns) and select those elements of the result that are equal to 1. Then we need to take the same elements from the list of images.如果是这种情况，那么我们可以汇总行（或列）并选择结果中等于 1 的元素。然后我们需要从图像列表中获取相同的元素。

At the moment I don't know how to remove the cycle at the second step.目前我不知道如何在第二步删除循环。

[images[i] for i in np.where(sm.sum(0) == 1)[0]]

UPDATE#1更新#1

The discussion above leads to a new understanding of the problem.上面的讨论使我们对这个问题有了新的认识。

A new idea is to delete images one at a time, choosing those that have the maximum number of similar ones.一个新的想法是一次删除一个图像，选择那些具有最大数量相似的图像。

images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']

sm = np.array([[1, 1, 1, 0, 1],
               [1, 1, 0, 0, 1],
               [1, 0, 1, 0, 0],
               [0, 0, 0, 1, 0],
               [1, 1, 0, 0, 1]])

ix = list(range(len(images)))

while sm[ix].T[ix].sum() != len(ix): # exit if we got the identity matrix
  va = sm[ix].T[ix].sum(0)           # count similar images
  jx = np.argmax(va)                 # get the index of the worst image
  del ix[jx]                         # delete index of the worst image

print([images[i] for i in ix])

Output:输出：

['image_2', 'image_3', 'image_4']

UPDATE#2更新#2

The same but with check of every branch with the worst value of similarity相同，但检查每个分支的相似度最差

res = []

def get_wres(sm, ix):
  if sm[ix].T[ix].sum() == len(ix):
    res.append(list(ix))
    return
  va = sm[ix].T[ix].sum(0) # count similar images
  vx = np.max(va)          # get the value of the worst
  for i in range(len(ix)): # check every image
    if va[i] == vx:        # for the worst value
      ixn = list(ix)       # isolate one worst
      del ixn[i]           # image and
      get_wres(sm, ixn)    # try without it

get_wres(sm, ix)
print(res)

Output:输出：

[[2, 3, 4], [1, 2, 3]]

Answer 3

final edit: This solution is wrong, see poster's answer.最终编辑：此解决方案是错误的，请参阅海报的答案。 I am leaving this post because it was mentioned couple of times.我离开这篇文章是因为它被提到过几次。

Here is with a foor loop, not sure how to get it done without one:这是一个 foo 循环，不知道如何在没有循环的情况下完成它：

results = [images[i] for i in range(len(images)) if sum(sm[i][i:]) == 1]

edit:编辑：

Here is a corrected solution, it's does essentially the same thing that @Sergey's solution but in a different way这是一个更正的解决方案，它与@Sergey 的解决方案基本相同，但方式不同

def put_zeros_to_image_with_most_similarities(arr: np.array):
    index = np.sum(arr, axis=1).argmax()
    if np.sum(arr[index], axis=0) == 1:
        return
    arr[index] = 0
    arr[:, index] = 0
for _ in sm:
    put_zeros_to_image_with_most_similarities(sm)
results = [images[i] for i in range(len(images)) if sum(sm[i][i:]) == 1]

按相似关系过滤图像列表

问题描述

3 个解决方案

解决方案1
5 已采纳 2020-01-26 23:21:53

解决方案2
3 2020-01-25 09:15:16

解决方案3
1 2020-01-25 10:31:07

按相似关系过滤图像列表

问题描述

3 个解决方案

解决方案1 5 已采纳 2020-01-26 23:21:53

解决方案2 3 2020-01-25 09:15:16

解决方案3 1 2020-01-25 10:31:07

解决方案1
5 已采纳 2020-01-26 23:21:53

解决方案2
3 2020-01-25 09:15:16

解决方案3
1 2020-01-25 10:31:07