简体   繁体   中英

Why is my Python script being killed that opens many Image files with Pillow

I have a python script that uses Pillow to open all png/jpg/jpeg files in a folder, copy some of their meta data (file name, size, width, height, pixels, size, etc) into a new object and push that object into a list called imageMetaData. I then traverse that list to compare every image to every other image to try and delete duplicate images (I have amassed a TON of duplicates to the point where of my 6000 images at least 1500 may be duplicates)

With a small size of images (~1500 is the biggest i have done successfully) it works fine: but when trying to run on my folder that has 6100 files it does not successfully create the imageMetaData list successfully and actually prints:

zsh: killed     python3 remove-duplicates.py

I have looked into this and it seems to be running out of ram. But it seems like my RAM should be enough to hold a list of ~6000 objects where each object has about 8 fields.

My function is below:

from PIL import Image
from os import listdir

mypath = 'my-path-to-folder/remove-dupes/'
initialLocation = 'my-folder-of-photos'
directoryList = listdir(mypath + initialLocation)

def loadObjects():
    myObjects = []
    if len(directoryList) > 1:
        for x in range(len(directoryList)):
            if ('jp' in directoryList[x].lower() or 'png' in directoryList[x].lower()):
                i = Image.open(mypath + initialLocation + '/' + directoryList[x])
                width, height = i.size
                pixels = i.load()
                i.close()
                myObjects.append({
                    'name': directoryList[x],
                    'width': width,
                    'height': height,
                    'pixels': pixels,
                    'size': os.stat(mypath + initialLocation + '/' + directoryList[x]).st_size,
                    'biggest': directoryList[x],
                    'index': x
                })
    return myObjects

As can be seen, the image is opened, loaded, and closed (correctly?) so i dont think I am leaving anything hanging. Any ideas as to why this is being killed or possibly how to get more details into why it was killed?

while it wasnt too pretty looking i did use Noah's suggestion to cut down on pushing the entire pixel array of each image into my new object by selecting 20 pixels using the height and width of the image. Unfortunately using hash(pixels) seemed to be returning a unique value even when the pixel arrays should have been the same? Would need to look into why that would be before attempting to cut out this logic and use hash values.

def getRandomPixels(pixels, width, height):
    randomPixels = []
    randomPixels.append(pixels[0,0])
    randomPixels.append(pixels[width - 1, height - 1])
    randomPixels.append(pixels[0, height - 1])
    randomPixels.append(pixels[width - 1, 0])
    randomPixels.append(pixels[width/2, height/2])
    randomPixels.append(pixels[0, height/2])
    randomPixels.append(pixels[width/2, 0])
    randomPixels.append(pixels[width - 1, height/2])
    randomPixels.append(pixels[width/2, height - 1])
    randomPixels.append(pixels[width/4, height/4])
    randomPixels.append(pixels[0, height/4])
    randomPixels.append(pixels[width/4, 0])
    randomPixels.append(pixels[width - 1, height/4])
    randomPixels.append(pixels[width/4, height - 1])
    randomPixels.append(pixels[width/8, height/8])
    randomPixels.append(pixels[0, height/8])
    randomPixels.append(pixels[width/8, 0])
    randomPixels.append(pixels[width - 1, height/8])
    randomPixels.append(pixels[width/8, height - 1])
    randomPixels.append(pixels[width/2, height/8])
    return randomPixels

I chose to select the pixels like this because the images in the folder could be different dimensions and this ensures that a picture of 1024 x 2048 will always compare itself to another picture of the same dimension by looking at the same pixels. it also selects from portions of the image that are further apart and less likely to be similar unless the picture is a duplicate.

with this new method i was able to run through 4975 images and load them in 618 seconds, and then the checking of duplicates took only 61 seconds due to the pixel array i was previously running through being 20x20 instead of height x width! Thanks Noah!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM