简体   繁体   English

使用 Python 从 pdf 中提取图像

[英]Extracting images from pdf using Python

How can we extract images(only images) from PDF.我们如何从 PDF 中提取图像(仅图像)。

I used many online tools, they all are not universal.我使用了很多在线工具,它们都不是通用的。 In most of the PDF, it tools the screenshot of the whole image instead of the image.在大多数 PDF 中,它使用整个图像的屏幕截图而不是图像。 PDF link -> sg.inflibnet.ac.in:8080/jspui/bitstream/10603/121661/9/09_chapter 4.pdf PDF 链接-> sg.inflibnet.ac.in:8080/jspui/bitstream/10603/121661/9/09_chapter 4.pdf

Here's a solution with PyMuPDF:这是 PyMuPDF 的解决方案:

#!python3.6
import fitz  # PyMuPDF


def get_pixmaps_in_pdf(pdf_filename):
    doc = fitz.open(pdf_filename)
    xrefs = set()
    for page_index in range(doc.pageCount):
        for image in doc.getPageImageList(page_index):
            xrefs.add(image[0])  # Add XREFs to set so duplicates are ignored
    pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
    doc.close()
    return pixmaps


def write_pixmaps_to_pngs(pixmaps):
    for i, pixmap in enumerate(pixmaps):
        pixmap.writePNG(f'{i}.png')  # Might want to come up with a better name


pixmaps = get_pixmaps_in_pdf(r'C:\StackOverflow\09_chapter 4.pdf')
write_pixmaps_to_pngs(pixmaps)

Here is some code that reads a PDF-File using pyPdf, extracts images and yields them as a PIL.Image .这是一些使用 pyPdf 读取 PDF 文件、提取图像并将它们生成为PIL.Image的代码。 You need to modify it to your needs, it's just here to demonstrate how to walk the object tree.您需要根据需要对其进行修改,这里只是为了演示如何遍历 object 树。

import io
import pyPdf
import PIL.Image

infile_name = 'my.pdf'

with open(infile_name, 'rb') as in_f:
    in_pdf = pyPdf.PdfFileReader(in_f)
    for page_no in range(in_pdf.getNumPages()):
        page = in_pdf.getPage(page_no)

        # Images are part of a page's `/Resources/XObject`
        r = page['/Resources']
        if '/XObject' not in r:
            continue
        for k, v in r['/XObject'].items():
            vobj = v.getObject()
            # We are only interested in images...
            if vobj['/Subtype'] != '/Image' or '/Filter' not in vobj:
                continue
            if vobj['/Filter'] == '/FlateDecode':
                # A raw bitmap
                buf = vobj.getData()
                # Notice that we need metadata from the object
                # so we can make sense of the image data
                size = tuple(map(int, (vobj['/Width'], vobj['/Height'])))
                img = PIL.Image.frombytes('RGB', size, buf,
                                          decoder_name='raw')
                # Obviously we can't really yield here, do something with `img`...
                yield img
            elif vobj['/Filter'] == '/DCTDecode':
                # A compressed image
                img = PIL.Image.open(io.BytesIO(vobj._data))
                yield img

Other solutions didn't work for me, so here's my solution:其他解决方案对我不起作用,所以这是我的解决方案:

Install PyMuPDF with:使用以下命令安装 PyMuPDF:

pip install pymupdf

Create and run following script.创建并运行以下脚本。 This script assumes that PDF is stored in pdfs directory and extracted images needs to be stored in images directory inside current directory.此脚本假设 PDF 存储在pdfs目录中,提取的图像需要存储在当前目录内的images目录中。

#!/usr/bin/env python3

import fitz

doc = fitz.open('pdfs/some.pdf')

image_xrefs = {}

for page in doc:
    for image in page.get_images():
        image_xrefs.setdefault(image[0])

for index, xref in enumerate(image_xrefs):
    img = doc.extract_image(xref)
    if img:
        with open(f'images/{index}.{img["ext"]}', 'wb') as image:
            image.write(img['image'])

Not all PDFs are simplly just text and image so in this Question case there is a hybrid as seen when the area around the figure image zone is selected. Not all PDFs are simplly just text and image so in this Question case there is a hybrid as seen when the area around the figure image zone is selected. The hint is the file says Adobe Paper Capture so was OCRed and not all text was captured.!提示是文件说 Adobe Paper Capture 所以是 OCRed 并且并非所有文本都被捕获。! The OP expected the figure to be extractable from within the whole page. OP 期望该数字可以从整个页面中提取。

"it tools the screenshot of the whole image instead of the image." “它使用整个图像而不是图像的屏幕截图。”

在此处输入图像描述

Hsps on the cellw ar surface Dead cells were gated by staining with propidium iodide.
~
(a) Control
~
cv
Ml
76.55
49.94
§
1-
M2
0.21
12.11
93.53
9.65
~
.. .,
"'
(b) Experimental
<I
Ml
3.49
100
10'
104
M2
93.31
232.80
99.24
283.87
Fig. 2a. Flow cytometric analysis of expression of GroEL on the surface of vegetative cells of B.

Using any pdfimage query tool we see that page has more silly entries than valid ones使用任何 pdfimage 查询工具,我们看到该页面的愚蠢条目比有效条目多

pdfimages  -list -f 12 -l 12 -verbose "09_chapter 4.pdf" -
[processing page 12]
--0000.pbm: page=12 width=2412 height=3436 hdpi=300.00 vdpi=300.00 colorspace=DeviceGray bpc=1
--0001.pbm: page=12 width=1 height=1 hdpi=0.44 vdpi=2.03 mask bpc=1
--0002.pbm: page=12 width=1 height=1 hdpi=0.53 vdpi=2.59 mask bpc=1
--0003.pbm: page=12 width=1 height=1 hdpi=0.49 vdpi=2.27 mask bpc=1

and extract images will simply extract the scanned page and three files that are simply a 1x1 pixel dot.和提取图像将简单地提取扫描的页面和三个文件,这些文件只是一个 1x1 像素点。 Thus the outputs will look like only 25 % was recovered but not as the OP expected a source diagram/figure.因此,输出看起来只有 25% 被恢复,但不像 OP 预期的源图/图。

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM