简体   繁体   中英

Remove Image Duplicates using Hashing in Python

I am performing Data Cleaning on an Image Dataset wherein duplicate images are present for human faces. The duplicate images may not be exactly similar but they are almost the same.

To implement this, I used the average hashing to first find the hash values of all the images and then find the difference in the hash values wrt all images in the directory. Images having difference less than 15 are considered as duplicates and only one image from the duplicates shall be present in the cleaned dataset.

Here is the code implementation:
First we calculate the hash_values for all the images and return the image_ids and respective hash_values

def calculate_hash(dir):
    """Generate Hash Values for all images in a directory

    Args:
        dir (str): Directory to search for images

    Returns:
        hash_values (list): List of hash values for all images in the directory
        image_ids (list): List of image ids for all images in the directory
    """

    hash_values = []
    image_ids = []

    for file in os.listdir(dir):

        path = os.path.join(dir, file)
        img = Image.open(path)

        hash = imagehash.average_hash(img)

        hash_values.append(hash)
        image_ids.append(file)

    return image_ids, hash_values

# Obtain image_ids and respective hash values
image_ids, hash_values = calculate_hash("D:/test_dir/images/test_duplicates")

Then we prepare a dataframe, with the image_ids , hash_values and additional columns for all the image_id differences and set it as 0.

def prepare_dataframe(image_ids, hash_values):

    # Create DataFrame with hash values and image ids
    df = pd.DataFrame(
        {
            "image_ids": image_ids, 
            "hash_values": hash_values, 
            }
    )

    # Create new columns in df with image_ids having hash difference value=0 
    for i in range(len(df.image_ids)):
        df[f"diff_{image_ids[i]}"] = 0

    return df

# Obtain dataframe
df = prepare_dataframe(image_ids, hash_values)

在此处输入图像描述

This is how the prepared dataframe looks like. The images 1,2 are completely distinct. And images 3.1, 3.2, 3.3 are duplicates (by visual inspection). The final cleaned data should only contain images 1,2,3.1.

Now I calculate the hash value difference for every image_id wrt every image_id

def calculate_differences(df):

    # Obtain difference for every image_id one by one
    for i in range(len(df.hash_values)):
        differences = []

        for j in range(len(df.hash_values)):
            differences.append(df.hash_values[i] - df.hash_values[j])

        # Store the difference values for every image_id
        df.iloc[i, 2:] = differences

    return df

df = calculate_differences(df)

This gives us the following dataframe:

在此处输入图像描述

It is clear from the hash difference values that 3.1, 3.2 and 3.3 are duplicates. But I cannot understand how to extract the desired output ie, list of unique_image_ids = [1,2,3.1]

I have written the following code but it removes any image having duplicates ie, 3.1 also gets removed from the final dataframe.

# For every image_id, find the column values having value < 15 more than once and delete respective rows

def remove_duplicates(df):
    
        for i in range(len(df.image_ids)):
            clean_df = df.drop(df[df[f"diff_{df.image_ids[i]}"] < 15].index)
    
        return clean_df

clean_df = remove_duplicates(df)

在此处输入图像描述

The desired output should also have image 3.1, but it does not appear in the dataframe.

Is there and optimized way to achieve this?

With the following dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        "image_ids": ["1.jpg", "2.jpg", "3.1.jpg", "3.2.jpg", "3.3.jpg", "3.4.jpg"],
        "hash_values": [
            "ff547aqu1f5",
            "ff197aqu1f5",
            "ff224aqu1f5",
            "ff349aqu1f5",
            "ff447aqu1f5",
            "ff999aqu1f5",
        ],
        "diff_1.jpg": [0, 33, 28, 28, 26, 28],
        "diff_2.jpg": [33, 0, 33, 31, 31, 31],
        "diff_3.1.jpg": [28, 33, 0, 8, 6, 8],
        "diff_3.2.jpg": [28, 31, 8, 0, 4, 2],
        "diff_3.3.jpg": [26, 31, 6, 4, 0, 2],
        "diff_3.4.jpg": [28, 31, 8, 2, 2, 0],
    }
)

You can filter like this:

def remove_duplicates(df):
    mask = (df[df.columns[2:]] > 0) & (df[df.columns[2:]] < 15)
    return df[~(pd.DataFrame(mask).any(axis=1))].reset_index(drop=True)

print(remove_duplicates(df))
# Output
  image_ids  hash_values  diff_1.jpg  diff_2.jpg  diff_3.1.jpg  diff_3.2.jpg  diff_3.3.jpg  diff_3.4.jpg
0     1.jpg  ff547aqu1f5           0          33            28            28            26            28
1     2.jpg  ff197aqu1f5          33           0            33            31            31            31
def remove_duplicates(df):
    # image = [range(0,6)]
    for i in range(len(df.image_ids)):
        # clean_df = df.
        clean_df = df.drop(df[df[f"diff_{df.image_ids[i]}"] < 15][1:].index)

    return clean_df

clean_df = remove_duplicates(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM