按相似关系过滤图像列表

Question

我有一个图像名称列表和一个（阈值）相似度矩阵。 相似关系是自反和对称的，但不一定是可传递的，即如果image_i与image_j和image_k相似，则不一定意味着image_j和image_k相似。

例如：

images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']

sm = np.array([[1, 1, 1, 0, 1],
               [1, 1, 0, 0, 1],
               [1, 0, 1, 0, 0],
               [0, 0, 0, 1, 0],
               [1, 1, 0, 0, 1]])

相似度矩阵sm解释如下：如果sm[i, j] == 1则image_i和image_j相似，否则它们不相似。 这里我们看到image_0类似于image_1和image_2 ，但image_1和image_2不相似（这只是非传递性的一个例子）。

我想保留最大数量的唯一图像（根据给定的sm矩阵，它们都是成对非相似的）。 对于此示例，它将是[image_2, image_3, image_4]或[image_1, image_2, image_3] （通常有多个这样的子集，但我不介意保留哪个，只要它们是最大长度）。 我正在寻找一种有效的方法来做到这一点，因为我有成千上万的图像。

编辑：我原来的解决方案如下

np.array(images)[np.tril(sm).sum(0) == 1]

但是不能保证它会返回一个最大长度的子集。 考虑以下示例：

sm = np.array([[1, 1, 0, 0, 0],
               [1, 1, 0, 0, 0],
               [0, 0, 1, 1, 0],
               [0, 0, 1, 1, 1],
               [0, 0, 0, 1, 1]])

此解决方案将返回['image_1', 'image_4'] ，而所需的结果是['image_0', 'image_2', 'image_4']或['image_1', 'image_2', 'image_4'] 。

更新：请参阅我的回答，它使用图论更详细地解释了问题。 我仍然愿意接受建议，因为我还没有找到一种相当快速的方法来实现数千张图像列表的结果。

Answer 1

稍微研究了一下，发现这就是图论中所谓的最大独立集问题，可惜是NP-hard问题。

图 G 的独立集合S 是 G 的顶点的子集，因此 S 中没有顶点彼此相邻。 在我们的例子中，我们正在寻找一个最大独立集（MIS），即具有最大可能顶点数的独立集。

有几个用于处理图形和网络的库，例如igraph或NetworkX ，它们具有查找最大独立集的功能。 我最终使用了 igraph。

对于我的问题，我们可以将图像视为图 G 的顶点，将“相似度矩阵”视为邻接矩阵：

images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']

sm = np.array([[1, 1, 1, 0, 1],
               [1, 1, 0, 0, 1],
               [1, 0, 1, 0, 0],
               [0, 0, 0, 1, 0],
               [1, 1, 0, 0, 1]])

# Adjacency matrix
adj = sm.copy()
np.fill_diagonal(adj, 0)

# Create the graph
import igraph
g = igraph.Graph.Adjacency(adj.tolist(), mode='UNDIRECTED')

# Find the maximum independent sets
g.largest_independent_vertex_sets()
[(1, 2, 3), (2, 3, 4)]

不幸的是，这对于数千个图像（顶点）来说太慢了。 所以我仍然愿意接受关于更快的方法的建议（也许不是找到所有的 MIS，而是找到一个）。

注意：@Sergey (UPDATE#1) 和 @marke 提出的解决方案并不总是返回一个 MIS——它们是贪婪的近似算法，它们删除最大度数的顶点，直到没有边缘为止。 为了证明这一点，请考虑以下示例：

sm = np.array([[1, 1, 0, 0, 0, 1],
               [1, 1, 0, 1, 0, 0],
               [0, 0, 1, 1, 1, 0],
               [0, 1, 1, 1, 0, 0],
               [0, 0, 1, 0, 1, 1],
               [1, 0, 0, 0, 1, 1]])

两种解决方案都返回[3, 5] ，但对于此示例，最大独立集为两个[(0, 3, 4), (1, 2, 5)] ，正如igraph正确找到的那样。 要了解为什么这些解决方案无法找到 MIS，下面是一个 gif，显示了如何在每次迭代中删除顶点和边（这是np.argmax的“副作用”，返回第一次出现多次出现的最大值)：

Sergey 的解决方案（UPDATE#2）似乎有效，但它比 igraph 的largest_independent_vertex_sets()慢得多。 对于速度比较，您可以使用以下随机生成的长度为 100 的相似度矩阵：

a = np.random.randint(2, size=(100, 100))

# create a symmetric similarity matrix
sm = np.tril(a) + np.tril(a, -1).T  
np.fill_diagonal(sm, 1)  

# create adjacency matrix for igraph
adj = sm.copy()
np.fill_diagonal(adj, 0)

更新：事实证明，尽管我有数千个图像 - 顶点，但边的数量相对较少（即我有一个稀疏图），因此就速度而言，使用 igraph 查找 MIS 是可以接受的。 或者，作为一种折衷方案，可以使用贪心近似算法来寻找一个大的独立集（如果足够幸运的话，也可以使用 MIS）。 下面是一个看起来相当快的算法：

def independent_set(adj):
    ''' 
    Given adjacency matrix, returns an independent set
    of size >= np.sum(1/(1 + adj.sum(0)))
    '''
    adj = np.array(adj, dtype=bool).astype(np.uint8)
    np.fill_diagonal(adj, 1)  # for the purposes of algorithm

    indep_set = set(range(len(adj)))
    # Loop until no edges remain
    while adj.sum(0).max() > 1: 
        degrees = adj.sum(0)
        # Randomly pick a vertex v of max degree
        v = random.choice(np.where(degrees == degrees.max())[0])
        # "Remove" the vertex v and the edges to its neigbours
        adj[v, :], adj[:, v] = 0, 0      
        # Update the maximal independent set
        indep_set.difference_update({v})
    return indep_set

或者更好的是，我们可以得到一个最大的独立集：

def maximal_independent_set(adj):  
    adj = np.array(adj, dtype=bool).astype(np.uint8)
    degrees = adj.sum(0)
    V = set(range(len(adj)))  # vertices of the graph
    mis = set()  # maximal independent set
    while V:
        # Randomly pick a vertex of min degree
        v = random.choice(np.where(degrees == degrees.min())[0])
        # Add it to the mis and remove it and its neighbours from V
        mis.add(v)
        Nv_c = set(np.nonzero(adj[v])[0]).union({v})  # closed neighbourhood of v
        V.difference_update(Nv_c)
        degrees[list(Nv_c)] = len(adj) + 1
    return mis

Answer 2

据我了解，独特的图像是那些与其他图像不同的图像。 如果是这种情况，那么我们可以汇总行（或列）并选择结果中等于 1 的元素。然后我们需要从图像列表中获取相同的元素。

目前我不知道如何在第二步删除循环。

[images[i] for i in np.where(sm.sum(0) == 1)[0]]

更新#1

上面的讨论使我们对这个问题有了新的认识。

一个新的想法是一次删除一个图像，选择那些具有最大数量相似的图像。

images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']

sm = np.array([[1, 1, 1, 0, 1],
               [1, 1, 0, 0, 1],
               [1, 0, 1, 0, 0],
               [0, 0, 0, 1, 0],
               [1, 1, 0, 0, 1]])

ix = list(range(len(images)))

while sm[ix].T[ix].sum() != len(ix): # exit if we got the identity matrix
  va = sm[ix].T[ix].sum(0)           # count similar images
  jx = np.argmax(va)                 # get the index of the worst image
  del ix[jx]                         # delete index of the worst image

print([images[i] for i in ix])

输出：

['image_2', 'image_3', 'image_4']

更新#2

相同，但检查每个分支的相似度最差

res = []

def get_wres(sm, ix):
  if sm[ix].T[ix].sum() == len(ix):
    res.append(list(ix))
    return
  va = sm[ix].T[ix].sum(0) # count similar images
  vx = np.max(va)          # get the value of the worst
  for i in range(len(ix)): # check every image
    if va[i] == vx:        # for the worst value
      ixn = list(ix)       # isolate one worst
      del ixn[i]           # image and
      get_wres(sm, ixn)    # try without it

get_wres(sm, ix)
print(res)

输出：

[[2, 3, 4], [1, 2, 3]]

Answer 3

最终编辑：此解决方案是错误的，请参阅海报的答案。 我离开这篇文章是因为它被提到过几次。

这是一个 foo 循环，不知道如何在没有循环的情况下完成它：

results = [images[i] for i in range(len(images)) if sum(sm[i][i:]) == 1]

编辑：

这是一个更正的解决方案，它与@Sergey 的解决方案基本相同，但方式不同

def put_zeros_to_image_with_most_similarities(arr: np.array):
    index = np.sum(arr, axis=1).argmax()
    if np.sum(arr[index], axis=0) == 1:
        return
    arr[index] = 0
    arr[:, index] = 0
for _ in sm:
    put_zeros_to_image_with_most_similarities(sm)
results = [images[i] for i in range(len(images)) if sum(sm[i][i:]) == 1]

按相似关系过滤图像列表

问题描述

3 个解决方案

解决方案1
5 已采纳 2020-01-26 23:21:53

解决方案2
3 2020-01-25 09:15:16

解决方案3
1 2020-01-25 10:31:07

按相似关系过滤图像列表

问题描述

3 个解决方案

解决方案1 5 已采纳 2020-01-26 23:21:53

解决方案2 3 2020-01-25 09:15:16

解决方案3 1 2020-01-25 10:31:07

解决方案1
5 已采纳 2020-01-26 23:21:53

解决方案2
3 2020-01-25 09:15:16

解决方案3
1 2020-01-25 10:31:07