Extract images of pdf with pymupdf in right order

Question

I am currently working on an Python 3.x image extractor for pdf-files and can't seem to find a solution for the problem I have been facing throughout my work. My intention is to extract all the images of pdf-files (vehicle reports) without the logos of the company that provides these papers. So far I have a working code using fitz, that finds the images and stores them (I found this code in the internet). Unfortunately they are returned in the wrong order. For annotating the pictures with their headings, they have to be saved in the order how they are seen in the pdf.

I already tried to get this right by using the object names defined in the xref-String (string defining an object in the pdf) in ascending order. Before that version I annotated the pictures with a counter through a dict (which I know is unsorted, but fixed it with sorting the keys), but had about 2-4 of approximatley 30 images unsorted. Additionally this code doens't seem to be a good solution for me because I 'fake' the image number by annotating a counter.

My current version (xref Name):

import fitz
import sys
import re

checkXO = r"/Type(?= */XObject)"       # finds "/Type/XObject"   
checkIM = r"/Subtype(?= */Image)"      # finds "/Subtype/Image"
doc = fitz.open(fr"{pdfpath}")

lenXREF = doc._getXrefLength()         # number of objects 
pixmaps = {}
imgcount=0
count=0
imglist=[]
for i in range(1, lenXREF):            # scan through all objects
    text = doc._getXrefString(i)     # string defining the object

    isXObject = re.search(checkXO, text)    # tests for XObject
    isImage   = re.search(checkIM, text)    # tests for Image
    if not isXObject or not isImage:   # not an image object if not both True
        continue
    count+=1
    pix = fitz.Pixmap(doc, i)          # make pixmap from image
    if re.search(r'Name \WIm(\d+)',text) != None:
        imglist.append(re.search(r'Name \W(Im\d+)',text).group(1))
        pixmaps[re.search(r'Name \W(Im\d+)',text).group(1)]=pix
    if re.search(r'Name \W(Im\d+)',text) == None:
        imglist.append(count)
        pixmaps[count]=pix
imglist1=[]
for i in range(1,doc.pageCount):
    if len(doc.getPageImageList(i))>1:
        for entry in doc.getPageImageList(i):
            imglist1.append(entry[7])
        break
for entry in imglist1:    
    pixmaps[entry].writeImage(fr"{dirpath}\%s.jpg" % (imgcount),'jpg')        
    imgcount+=1

Feel free to also suggest a completely new way to work on this task. Thanks in advance for your help.

Answer 1

Answer from repo maintainer:

In the newer PyMuPDF versions (best use v1.17.0) you can get an image's position on the page. This seems to be your intention wehen you talk of "right oder": rect = page.getImageBbox(name) , where name is your entry[7] above.

Answer 2

Use the sorted() for the image list. if you can use the different version refer to https://stackoverflow.com/a/68267356/7240889

Extract images of pdf with pymupdf in right order

Question

2 answers

solution1
3 2020-06-11 19:50:07

solution2
0 2021-07-06 08:59:45

Extract images of pdf with pymupdf in right order

Question

2 answers

solution1 3 2020-06-11 19:50:07

solution2 0 2021-07-06 08:59:45

solution1
3 2020-06-11 19:50:07

solution2
0 2021-07-06 08:59:45