Filter a list of images by similarity relationship

Question

I have a list of images names and a (thresholded) similarity matrix for them. The similarity relationship is reflexive and symmetric but not necessary transitive, ie if image_i is similar to image_j and to image_k , then it doesn't necessary mean that image_j and image_k are similar.

For example:

images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']

sm = np.array([[1, 1, 1, 0, 1],
               [1, 1, 0, 0, 1],
               [1, 0, 1, 0, 0],
               [0, 0, 0, 1, 0],
               [1, 1, 0, 0, 1]])

The similarity matrix sm is interpreted as follows: if sm[i, j] == 1 then image_i and image_j are similar, otherwise they are not similar. Here we see that image_0 is similar to image_1 and image_2 , but image_1 and image_2 are not similar (this is just one example of non-transitivity).

I want to keep the maximum number of unique images (that are all pairwise non-similar according to the given sm matrix). For this example it would be [image_2, image_3, image_4] or [image_1, image_2, image_3] (in general there are multiple such subsets but I don't mind which to keep as long as they are of maximum length). I am looking for an efficient way to do this since I have thousands of images.

Edit : My original solution was the following

np.array(images)[np.tril(sm).sum(0) == 1]

However it's not guaranteed that it will return a maximun length subset . Consider the following example:

sm = np.array([[1, 1, 0, 0, 0],
               [1, 1, 0, 0, 0],
               [0, 0, 1, 1, 0],
               [0, 0, 1, 1, 1],
               [0, 0, 0, 1, 1]])

This solution will return ['image_1', 'image_4'] , whereas the desired result is ['image_0', 'image_2', 'image_4'] or ['image_1', 'image_2', 'image_4'] .

Update : Please see my answer which explains the problem in more detail using graph theory. I am still open to suggestions since I haven't found a reasonably fast way to achieve the result for a list of thousands of images.

Answer 1

After researching it a little bit more, I found that this is the so called maximum independent set problem in graph theory, which is unfortunately NP-hard.

An independent set S of a graph G is a subset of vertices of G, such that no vertices in S are adjacent to one another. In our case, we are looking to find a maximum independent set (MIS), ie an independent set with the largest possible number of vertices.

There are a couple of libraries for working with graphs and networks, such as igraph or NetworkX , which have functions for finding maximum independent sets. I ended up using igraph.

For my problem, we can think of the images as vertices of a graph G and the "similarity matrix" as the adjacency matrix:

images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']

sm = np.array([[1, 1, 1, 0, 1],
               [1, 1, 0, 0, 1],
               [1, 0, 1, 0, 0],
               [0, 0, 0, 1, 0],
               [1, 1, 0, 0, 1]])

# Adjacency matrix
adj = sm.copy()
np.fill_diagonal(adj, 0)

# Create the graph
import igraph
g = igraph.Graph.Adjacency(adj.tolist(), mode='UNDIRECTED')

# Find the maximum independent sets
g.largest_independent_vertex_sets()
[(1, 2, 3), (2, 3, 4)]

Unfortunately this is too slow for thousands of images (vertices). So I am still open to suggestions for a faster way to do it (perhaps instead of finding all the MIS, just find one).

Note : the proposed solutions by @Sergey (UPDATE#1) and @marke don't always return a MIS -- they are greedy approximate algorithms which delete a vertex of maximum degree until no edge remains. To demonstrate this, consider the following example:

sm = np.array([[1, 1, 0, 0, 0, 1],
               [1, 1, 0, 1, 0, 0],
               [0, 0, 1, 1, 1, 0],
               [0, 1, 1, 1, 0, 0],
               [0, 0, 1, 0, 1, 1],
               [1, 0, 0, 0, 1, 1]])

Both solutions return [3, 5] , but for this example the maximum independent sets are two, [(0, 3, 4), (1, 2, 5)] , as are correctly found by igraph . To see why these solutions fail to find the MIS, below is a gif that shows how the vertices and edges are removed at each iteration (which is the "side effect" of np.argmax returning the first occurrence for multiple occurrences of the maximum value):

The Sergey's solution (UPDATE#2) seems to work, however it is much much slower than the igraph's largest_independent_vertex_sets() . For speed comparison you can use the following randomly generated similarity matrix of length 100:

a = np.random.randint(2, size=(100, 100))

# create a symmetric similarity matrix
sm = np.tril(a) + np.tril(a, -1).T  
np.fill_diagonal(sm, 1)  

# create adjacency matrix for igraph
adj = sm.copy()
np.fill_diagonal(adj, 0)

Update : it turns out that although I have thousands of images - vertices, the number of edges is relatively small (ie I have a sparse graph), so using igraph to find MIS is acceptable it terms of speed. Alternatively, as a compromise, one could use a greedy approximate algorithm for finding a large independent set (or a MIS if lucky enough). Below is an algorithm which seems pretty fast:

def independent_set(adj):
    ''' 
    Given adjacency matrix, returns an independent set
    of size >= np.sum(1/(1 + adj.sum(0)))
    '''
    adj = np.array(adj, dtype=bool).astype(np.uint8)
    np.fill_diagonal(adj, 1)  # for the purposes of algorithm

    indep_set = set(range(len(adj)))
    # Loop until no edges remain
    while adj.sum(0).max() > 1: 
        degrees = adj.sum(0)
        # Randomly pick a vertex v of max degree
        v = random.choice(np.where(degrees == degrees.max())[0])
        # "Remove" the vertex v and the edges to its neigbours
        adj[v, :], adj[:, v] = 0, 0      
        # Update the maximal independent set
        indep_set.difference_update({v})
    return indep_set

Or even better, we can get a maximal independent set:

def maximal_independent_set(adj):  
    adj = np.array(adj, dtype=bool).astype(np.uint8)
    degrees = adj.sum(0)
    V = set(range(len(adj)))  # vertices of the graph
    mis = set()  # maximal independent set
    while V:
        # Randomly pick a vertex of min degree
        v = random.choice(np.where(degrees == degrees.min())[0])
        # Add it to the mis and remove it and its neighbours from V
        mis.add(v)
        Nv_c = set(np.nonzero(adj[v])[0]).union({v})  # closed neighbourhood of v
        V.difference_update(Nv_c)
        degrees[list(Nv_c)] = len(adj) + 1
    return mis

Answer 2

As I understand it, unique images are those that are not like any others. If this is the case, then we can summarize the rows (or columns) and select those elements of the result that are equal to 1. Then we need to take the same elements from the list of images.

At the moment I don't know how to remove the cycle at the second step.

[images[i] for i in np.where(sm.sum(0) == 1)[0]]

UPDATE#1

The discussion above leads to a new understanding of the problem.

A new idea is to delete images one at a time, choosing those that have the maximum number of similar ones.

images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']

sm = np.array([[1, 1, 1, 0, 1],
               [1, 1, 0, 0, 1],
               [1, 0, 1, 0, 0],
               [0, 0, 0, 1, 0],
               [1, 1, 0, 0, 1]])

ix = list(range(len(images)))

while sm[ix].T[ix].sum() != len(ix): # exit if we got the identity matrix
  va = sm[ix].T[ix].sum(0)           # count similar images
  jx = np.argmax(va)                 # get the index of the worst image
  del ix[jx]                         # delete index of the worst image

print([images[i] for i in ix])

Output:

['image_2', 'image_3', 'image_4']

UPDATE#2

The same but with check of every branch with the worst value of similarity

res = []

def get_wres(sm, ix):
  if sm[ix].T[ix].sum() == len(ix):
    res.append(list(ix))
    return
  va = sm[ix].T[ix].sum(0) # count similar images
  vx = np.max(va)          # get the value of the worst
  for i in range(len(ix)): # check every image
    if va[i] == vx:        # for the worst value
      ixn = list(ix)       # isolate one worst
      del ixn[i]           # image and
      get_wres(sm, ixn)    # try without it

get_wres(sm, ix)
print(res)

Output:

[[2, 3, 4], [1, 2, 3]]

Answer 3

final edit: This solution is wrong, see poster's answer. I am leaving this post because it was mentioned couple of times.

Here is with a foor loop, not sure how to get it done without one:

results = [images[i] for i in range(len(images)) if sum(sm[i][i:]) == 1]

edit:

Here is a corrected solution, it's does essentially the same thing that @Sergey's solution but in a different way

def put_zeros_to_image_with_most_similarities(arr: np.array):
    index = np.sum(arr, axis=1).argmax()
    if np.sum(arr[index], axis=0) == 1:
        return
    arr[index] = 0
    arr[:, index] = 0
for _ in sm:
    put_zeros_to_image_with_most_similarities(sm)
results = [images[i] for i in range(len(images)) if sum(sm[i][i:]) == 1]

Filter a list of images by similarity relationship

Question

3 answers

solution1
5 ACCPTED 2020-01-26 23:21:53

solution2
3 2020-01-25 09:15:16

solution3
1 2020-01-25 10:31:07

Filter a list of images by similarity relationship

Question

3 answers

solution1 5 ACCPTED 2020-01-26 23:21:53

solution2 3 2020-01-25 09:15:16

solution3 1 2020-01-25 10:31:07

solution1
5 ACCPTED 2020-01-26 23:21:53

solution2
3 2020-01-25 09:15:16

solution3
1 2020-01-25 10:31:07