How to find images by similarity, inside a SQL query?

Question

I want to process images to establish say, 9 areas inside each one, then find the average color of each area, and then save it to a char field like this:

255,255,255,255,255,255,107,195,305

Then to find all the images similar to a given image, I have to calculate the distance between each pair of colors (comparing against the same areas), eg:

The difference between these images is 1:

255,255,255,255,255,255,107,195,305
255,255,255,255,255,255,107,195,304

The difference between these images is 3:

255,255,255,255,255,255,105,195,305
255,255,255,255,255,255,107,195,304

My problem is how do I perform such a query, and order it by similarity? The field is just a string, with values separated by commas.

Is it possible that a query like this could be fast? Or should I look for a different approach? We are talking about thousands of images

Edit: As @therealsix, one option could be to put each average color value into a separate column.

Answer 1

A more "SQLey" way to do this, might be to use a more normalized database approach, with 2 tables:

Image(ImageID int, ... other columns as required ...)
ImageZone(ImageID int, ZoneIndex int, ColourValue int, ...)

so for your example, you might have

ImageID   ZoneIndex   ColourValue
-------   ---------   -----------
  1          1          255
  1          2          255
  ...
  1          9          304
  2          1          255
  ...
  2          9          305

Then, to get the distance, something like (I'm a SQL Server guy, but this should be readily translatable to MySQL):

 SELECT
    Candidate.ImageID,
    Candidate.ImageFile, /* or whatever... */
    Scores.Difference
 FROM
 (
   SELECT
      Original.ImageID AS OriginalID,
      Candidate.ImageID AS CandidateID,
      SUM(ABS(Original.ColourValue - Candidate.ColourValue)) AS Difference
   FROM ImageZone AS Original
   INNER JOIN ImageZone AS Candidate
     ON (Original.ImageID <> Candidate.ImageID)
     ON (Original.ZoneIndex = Candidate.ZoneIndex)
 ) AS Scores
 INNER JOIN Image AS Candidate ON (Scores.CandidateID = Candidate.ImageID)
 WHERE Scores.OriginalID = 1 /* or the image ID you want to look up */
 ORDER BY Difference

So the inner query creates a row for every candidate zone, for example (O = original, C = candidate):

 O.ImageID  O.ZoneIndex  O.ColourValue  C.ImageID  C.ZoneIndex  C.ColourValue
 ---------  -----------  -------------  ---------  -----------  -------------
    1           1           255            2            1           255
    ... then ...
    1           9           305            2            9           304
    1           1           255            3            1            99
    ... then ...
    99          9           100           98            9            99

which are then aggregated into total differences:

 OriginalID  CandidateID  Difference
 ----------  -----------  ----------
    1            2            1
    1            3           10
    ...
    99          98          500

You then select from this virtual table, only where OriginalID is 1, and join it back onto the original Image table to get whatever details you need for the lowest 'difference' score (in this case, 2).

This is IMHO a much cleaner DB design (and perfectly suitable if you later use more zones, etc).

Answer 2

Actually, from the sounds of it, you are trying to do a form of Sequence alignment. For that there are series of algorithms were are used to compare sequences of genes:

Sequence Alignment

Answer 3

I think you are looking for something like this...

http://kodisha.net/color-names/?color=FD464A

Note the color difference on the right...

Color Difference: 1.84568861095

http://en.wikipedia.org/wiki/Color_difference

Running this query on 1000+ rows will certainly kill your server if there is a large number of simultaneous users.

Answer 4

I would suggest using mysql functions to compare to your randomly given image. First lets create a simple example table

DROP TABLE IF EXISTS images;

CREATE TABLE images (
  id         INTEGER AUTO_INCREMENT PRIMARY KEY,
  rgb_values VARCHAR(255)
);

Now lets define the functions we will use in our query. The first enables use to split the string on any delimiter and get the desired element back by index:

DROP FUNCTION SPLIT_STR;

CREATE FUNCTION SPLIT_STR(
  x VARCHAR(255),
  delim VARCHAR(12),
  pos INT
)
RETURNS VARCHAR(255)
RETURN
  REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos),
  LENGTH(SUBSTRING_INDEX(x, delim, pos -1)) + 1),
  delim, '')
;

Next we define a function to calculate the image difference per your algorithm (or any algo for that matter you want to use):

DROP FUNCTION IMAGE_DIFF;

CREATE FUNCTION IMAGE_DIFF(
  from_val VARCHAR(255),
  to_val VARCHAR(255)
)
RETURNS INTEGER(4)
RETURN
  ABS((SPLIT_STR(to_val, ',', 1) - SPLIT_STR(from_val, ',',1))) +
  ABS((SPLIT_STR(to_val, ',', 2) - SPLIT_STR(from_val, ',',2))) +
  ABS((SPLIT_STR(to_val, ',', 3) - SPLIT_STR(from_val, ',',3))) +
  ABS((SPLIT_STR(to_val, ',', 4) - SPLIT_STR(from_val, ',',4))) +
  ABS((SPLIT_STR(to_val, ',', 5) - SPLIT_STR(from_val, ',',5))) +
  ABS((SPLIT_STR(to_val, ',', 6) - SPLIT_STR(from_val, ',',6))) +
  ABS((SPLIT_STR(to_val, ',', 7) - SPLIT_STR(from_val, ',',7))) +
  ABS((SPLIT_STR(to_val, ',', 8) - SPLIT_STR(from_val, ',',8))) +
  ABS((SPLIT_STR(to_val, ',', 9) - SPLIT_STR(from_val, ',',9)))
;

Let's create some sample data:

INSERT INTO images(rgb_values) VALUES ("237,128,73,69,35,249,199,183,178");
INSERT INTO images(rgb_values) VALUES ("39,212,164,170,202,49,93,77,145");
INSERT INTO images(rgb_values) VALUES ("28,242,83,167,92,161,115,38,108");
INSERT INTO images(rgb_values) VALUES ("72,81,73,2,77,109,177,204,120");
INSERT INTO images(rgb_values) VALUES ("165,149,106,248,39,26,167,237,139");
INSERT INTO images(rgb_values) VALUES ("183,40,156,131,120,19,71,88,69");
INSERT INTO images(rgb_values) VALUES ("138,136,112,36,69,245,130,196,24");
INSERT INTO images(rgb_values) VALUES ("1,194,153,107,16,102,164,154,74");
INSERT INTO images(rgb_values) VALUES ("172,161,17,179,140,244,23,219,115");
INSERT INTO images(rgb_values) VALUES ("166,151,48,62,154,227,44,21,201");
INSERT INTO images(rgb_values) VALUES ("118,73,212,180,150,64,254,177,68");
INSERT INTO images(rgb_values) VALUES ("119,220,226,254,14,175,123,11,134");
INSERT INTO images(rgb_values) VALUES ("118,93,238,31,77,36,105,151,216");
INSERT INTO images(rgb_values) VALUES ("123,108,177,136,9,24,119,175,88");
INSERT INTO images(rgb_values) VALUES ("11,207,12,215,215,80,101,213,143");
INSERT INTO images(rgb_values) VALUES ("132,158,46,188,7,245,241,126,214");
INSERT INTO images(rgb_values) VALUES ("167,238,186,86,109,164,219,199,238");
INSERT INTO images(rgb_values) VALUES ("216,93,139,246,153,39,226,152,143");
INSERT INTO images(rgb_values) VALUES ("98,229,7,203,230,224,57,154,252");
INSERT INTO images(rgb_values) VALUES ("7,95,145,120,35,6,116,240,64");
INSERT INTO images(rgb_values) VALUES ("45,194,172,223,96,168,18,4,215");
INSERT INTO images(rgb_values) VALUES ("243,161,214,235,134,190,207,63,127");
INSERT INTO images(rgb_values) VALUES ("74,189,249,85,148,169,65,3,81");
INSERT INTO images(rgb_values) VALUES ("46,113,191,20,108,139,60,249,6");
INSERT INTO images(rgb_values) VALUES ("153,246,189,175,5,125,9,197,160");
INSERT INTO images(rgb_values) VALUES ("202,248,23,59,81,175,197,180,114");
INSERT INTO images(rgb_values) VALUES ("73,136,252,137,222,197,118,64,69");
INSERT INTO images(rgb_values) VALUES ("172,224,251,32,154,175,201,33,14");
INSERT INTO images(rgb_values) VALUES ("141,126,112,12,45,214,243,127,49");
INSERT INTO images(rgb_values) VALUES ("116,155,23,205,62,235,111,136,205");

and then run a query using our newly defined function against the image you want to compare with:

mysql> SELECT id
    ->      , image_diff(rgb_values, '255,191,234,123,85,23,255,255,255') rgb_diff
    ->   FROM images
    ->  ORDER BY 2 DESC;
+----+----------+
| id | rgb_diff |
+----+----------+
| 19 |     1150 |
| 10 |     1148 |
|  3 |     1122 |
| 27 |     1094 |
|  9 |     1070 |
| 15 |     1069 |
| 23 |     1061 |
| 21 |     1059 |
|  7 |     1034 |
| 12 |     1024 |
| 24 |     1022 |
| 30 |     1016 |
| 29 |      989 |
| 28 |      962 |
|  2 |      947 |
|  4 |      933 |
| 16 |      893 |
|  6 |      885 |
|  8 |      875 |
| 20 |      848 |
| 25 |      835 |
| 26 |      815 |
|  1 |      777 |
| 22 |      758 |
| 14 |      745 |
| 11 |      706 |
| 18 |      683 |
|  5 |      656 |
| 13 |      645 |
| 17 |      494 |
+----+----------+
30 rows in set (0.01 sec)

Answer 5

Okay so your table images has an id and 9 separate color fields- color1 through color 9

SELECT a.id, b.id, ( ABS( a.color1 - b.color ) + ABS( a.color2 + b.color2 ) + ABS( a.color3 + b.color3 ) + ... ) AS difference
FROM images AS a
JOIN images AS b
WHERE a.id > b.id
ORDER BY difference

This could be reasonably efficient, you would have to try it.

Answer 6

The problem seems to me not to be a sequence comparison problem but a geography one.

I think you want to find nearby points in a 9th dimensional point set.

Check this article on how spatial databases use R-trees for efficient searching of clusters (eg points nearby which is exactly what you want.): Incremental Distance Join Algorithms for Spatial Databases (click on the "Cached" link)

Real problem is that I know of no spatial database that supports 9 dimensions. Only hack I could think of would be a triple (A,B,C) of geography triplets of points.

To make my point more clear. Lets have a look at your data:

The difference between these images is 3:
 255,255,255,255,255,255,105,195,305 255,255,255,255,255,255,107,195,304 

We can look at the above 2 rows as 2 points (lets call them a and b ) in a 9-dimensional world.

The 9 numbers are their coordinates ( a1,a2,...,a9 and b1,b2,...,b9 ).

And the "difference" is their distance: Sum(|ai-bi|) . There are many ways to define distance and this is one of the common ones. It's not Euclidean distance but it is similar. And it's slightly more fast to be calculated.

Now, if you are really going to have thousands or millions of images, it's going to be a very slow procedure to calculate all those millions (or trillions) distances. If you just need to compare one at a time, against a few thousands, I think you already have two answers that would be OK.

But if you really want to find similar images and have a lot (like hundreds of thousands) of them stored, R-trees or some other index used by spatial databases will be better. An R-tree is not something magic, it's just an index specialized for this kind of multidimensional data.

If you can't find a spatial database that supports that many dimensions, I'm not sure how a solution of your own would be created.

One thought would be to split the 9 numbers into 3 triplets. Every triplet would be a 3-dimensional point. So, every of your images would be stored as three 3D geography points. Then, the difference between 2 images would be the sum of three (geographic) distances.

How to find images by similarity, inside a SQL query?

Question

6 answers

solution1
3 2011-04-08 05:21:38

solution2
0 2011-04-08 03:33:20

solution3
0 2011-04-08 03:38:55

solution4
0 ACCPTED 2011-04-08 04:11:28

solution5
0 2011-04-08 04:24:36

solution6
0 2011-04-08 09:14:49

How to find images by similarity, inside a SQL query?

Question

6 answers

solution1 3 2011-04-08 05:21:38

solution2 0 2011-04-08 03:33:20

solution3 0 2011-04-08 03:38:55

solution4 0 ACCPTED 2011-04-08 04:11:28

solution5 0 2011-04-08 04:24:36

solution6 0 2011-04-08 09:14:49

solution1
3 2011-04-08 05:21:38

solution2
0 2011-04-08 03:33:20

solution3
0 2011-04-08 03:38:55

solution4
0 ACCPTED 2011-04-08 04:11:28

solution5
0 2011-04-08 04:24:36

solution6
0 2011-04-08 09:14:49