简体   繁体   English

使用pymupdf按正确顺序提取pdf图像

[英]Extract images of pdf with pymupdf in right order

I am currently working on an Python 3.x image extractor for pdf-files and can't seem to find a solution for the problem I have been facing throughout my work.我目前正在开发用于 pdf 文件的 Python 3.x 图像提取器,似乎无法找到解决我在整个工作中遇到的问题的方法。 My intention is to extract all the images of pdf-files (vehicle reports) without the logos of the company that provides these papers.我的目的是提取 pdf 文件(车辆报告)的所有图像,而没有提供这些文件的公司的徽标。 So far I have a working code using fitz, that finds the images and stores them (I found this code in the internet).到目前为止,我有一个使用 fitz 的工作代码,它可以找到图像并存储它们(我在互联网上找到了这段代码)。 Unfortunately they are returned in the wrong order.不幸的是,它们以错误的顺序返回。 For annotating the pictures with their headings, they have to be saved in the order how they are seen in the pdf.为了用标题注释图片,它们必须按照它们在 pdf 中的显示顺序进行保存。

I already tried to get this right by using the object names defined in the xref-String (string defining an object in the pdf) in ascending order.我已经尝试通过按升序使用外部参照字符串(在 pdf 中定义对象的字符串)中定义的对象名称来正确地做到这一点。 Before that version I annotated the pictures with a counter through a dict (which I know is unsorted, but fixed it with sorting the keys), but had about 2-4 of approximatley 30 images unsorted.在那个版本之前,我通过一个 dict 用计数器注释图片(我知道它是未排序的,但通过对键进行排序来修复它),但是大约有 2-4 个未排序的大约 30 张图像。 Additionally this code doens't seem to be a good solution for me because I 'fake' the image number by annotating a counter.此外,这段代码对我来说似乎不是一个好的解决方案,因为我通过注释计数器来“伪造”图像编号。

My current version (xref Name):我当前的版本(外部参照名称):

import fitz
import sys
import re

checkXO = r"/Type(?= */XObject)"       # finds "/Type/XObject"   
checkIM = r"/Subtype(?= */Image)"      # finds "/Subtype/Image"
doc = fitz.open(fr"{pdfpath}")

lenXREF = doc._getXrefLength()         # number of objects 
pixmaps = {}
imgcount=0
count=0
imglist=[]
for i in range(1, lenXREF):            # scan through all objects
    text = doc._getXrefString(i)     # string defining the object

    isXObject = re.search(checkXO, text)    # tests for XObject
    isImage   = re.search(checkIM, text)    # tests for Image
    if not isXObject or not isImage:   # not an image object if not both True
        continue
    count+=1
    pix = fitz.Pixmap(doc, i)          # make pixmap from image
    if re.search(r'Name \WIm(\d+)',text) != None:
        imglist.append(re.search(r'Name \W(Im\d+)',text).group(1))
        pixmaps[re.search(r'Name \W(Im\d+)',text).group(1)]=pix
    if re.search(r'Name \W(Im\d+)',text) == None:
        imglist.append(count)
        pixmaps[count]=pix
imglist1=[]
for i in range(1,doc.pageCount):
    if len(doc.getPageImageList(i))>1:
        for entry in doc.getPageImageList(i):
            imglist1.append(entry[7])
        break
for entry in imglist1:    
    pixmaps[entry].writeImage(fr"{dirpath}\%s.jpg" % (imgcount),'jpg')        
    imgcount+=1  

Feel free to also suggest a completely new way to work on this task.也可以随意建议一种全新的方法来处理此任务。 Thanks in advance for your help.在此先感谢您的帮助。

Answer from repo maintainer:回购维护者的回答:

In the newer PyMuPDF versions (best use v1.17.0) you can get an image's position on the page.在较新的 PyMuPDF 版本(最好使用 v1.17.0)中,您可以获得图像在页面上的位置。 This seems to be your intention wehen you talk of "right oder": rect = page.getImageBbox(name) , where name is your entry[7] above.当您谈到“正确的奥德”时,这似乎是您的意图: rect = page.getImageBbox(name) ,其中 name 是您上面的entry[7]

Use the sorted() for the image list.对图像列表使用 sorted()。 if you can use the different version refer to https://stackoverflow.com/a/68267356/7240889如果您可以使用不同的版本,请参阅https://stackoverflow.com/a/68267356/7240889

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM