Quickly determining using Python whether an image is (fuzzily) in a collection

Question

Image that some new image X arrives, and I want to know if X is new or has already been encountered before. I have code, below, that shrinks the image and then converts it to a hash code. I can then see via a single hash look-up if I've already encountered an image with the same hash code, so it's very fast.

My question is, is there an efficient way for me to see if a similar image, but one with a different hash code, has already been seen? If was going to title this question something like "Data structure for determining efficiently whether a similar, non-identical item is already contained" but decided that would be an instance of the XY problem .

When I say that this new image is "similar," I'm thinking of one that's perhaps gone through lossy compression and so looks like the original to the human eye but is not identical. Normally shrinking the image eliminates the difference, but not always, and if I shrink the image too much I start getting false positives.

Here's my current code:

import PIL
seen_images = {} # This would really be a shelf or something

# From http://www.guguncube.com/1656/python-image-similarity-comparison-using-several-techniques
def image_pixel_hash_code(image):
    pixels = list(image.getdata())
    avg = sum(pixels) / len(pixels)
    bits = "".join(map(lambda pixel: '1' if pixel < avg else '0', pixels))  # '00010100...'
    hexadecimal = int(bits, 2).__format__('016x').upper()
    return hexadecimal

def process_image(filepath):
    thumb = PIL.Image.open(filepath).resize((128,128)).convert("L")
    code = image_pixel_hash_code(thumb)
    previous_image = seen_images.get(code, None)
    if code in seen_images:
        print "'{}' already seen as '{}'".format(filepath, previous_image)
    else:
        seen_images[code] = filepath

You can put a path to a bunch of image files into a variable called IMAGE_ROOT and then try my code out with:

import os
for root, dirs, files in os.walk(IMAGE_ROOT):
    for filename in files:
        filepath = os.path.join(root, filename)
        try:                
            process_image(filepath)
        except IOError:
            pass

Answer 1

There are a lot of methods for comparing images, but for your given example I suspect that simplicity and speed are the key factors (hence why you're trying to use a hash as a first-pass). Here are some suggestions - in all cases I'd suggest shrinking and cropping the image to a regular size and shape.

Smooth the image (gaussian blur) before shrinking to minimise the influence of artefacts. Then apply the hash or other comparison.
Subtract the images from one another (RGB) and check the remainder. Identical images will return zero, compression artefacts will result in small minor variations. You can either threshold, sum, or average the value and compare to a cut-off.
Use standard distance algorithsm (see scipy.spatial.distance ) to calculate 'distance' between the two images. For example euclidean distance will give effectively the same as the sum of subtracting, while cosine will ignore itensity but match the profile of changes over the image ie a darker version of the same image will be considered equivalent. For these you will need to flatten your image to a 1D array.

The last two entail comparing every image to every other image when uploading, and that is going to get very computationally expensive for large numbers of images.

Quickly determining using Python whether an image is (fuzzily) in a collection

Question

1 answers

solution1
0 2015-02-21 22:21:48

Quickly determining using Python whether an image is (fuzzily) in a collection

Question

1 answers

solution1 0 2015-02-21 22:21:48

solution1
0 2015-02-21 22:21:48