简体   繁体   中英

Extracting images from pdf using Python

How can we extract images(only images) from PDF.

I used many online tools, they all are not universal. In most of the PDF, it tools the screenshot of the whole image instead of the image. PDF link -> sg.inflibnet.ac.in:8080/jspui/bitstream/10603/121661/9/09_chapter 4.pdf

Here's a solution with PyMuPDF:

#!python3.6
import fitz  # PyMuPDF


def get_pixmaps_in_pdf(pdf_filename):
    doc = fitz.open(pdf_filename)
    xrefs = set()
    for page_index in range(doc.pageCount):
        for image in doc.getPageImageList(page_index):
            xrefs.add(image[0])  # Add XREFs to set so duplicates are ignored
    pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
    doc.close()
    return pixmaps


def write_pixmaps_to_pngs(pixmaps):
    for i, pixmap in enumerate(pixmaps):
        pixmap.writePNG(f'{i}.png')  # Might want to come up with a better name


pixmaps = get_pixmaps_in_pdf(r'C:\StackOverflow\09_chapter 4.pdf')
write_pixmaps_to_pngs(pixmaps)

Here is some code that reads a PDF-File using pyPdf, extracts images and yields them as a PIL.Image . You need to modify it to your needs, it's just here to demonstrate how to walk the object tree.

import io
import pyPdf
import PIL.Image

infile_name = 'my.pdf'

with open(infile_name, 'rb') as in_f:
    in_pdf = pyPdf.PdfFileReader(in_f)
    for page_no in range(in_pdf.getNumPages()):
        page = in_pdf.getPage(page_no)

        # Images are part of a page's `/Resources/XObject`
        r = page['/Resources']
        if '/XObject' not in r:
            continue
        for k, v in r['/XObject'].items():
            vobj = v.getObject()
            # We are only interested in images...
            if vobj['/Subtype'] != '/Image' or '/Filter' not in vobj:
                continue
            if vobj['/Filter'] == '/FlateDecode':
                # A raw bitmap
                buf = vobj.getData()
                # Notice that we need metadata from the object
                # so we can make sense of the image data
                size = tuple(map(int, (vobj['/Width'], vobj['/Height'])))
                img = PIL.Image.frombytes('RGB', size, buf,
                                          decoder_name='raw')
                # Obviously we can't really yield here, do something with `img`...
                yield img
            elif vobj['/Filter'] == '/DCTDecode':
                # A compressed image
                img = PIL.Image.open(io.BytesIO(vobj._data))
                yield img

Other solutions didn't work for me, so here's my solution:

Install PyMuPDF with:

pip install pymupdf

Create and run following script. This script assumes that PDF is stored in pdfs directory and extracted images needs to be stored in images directory inside current directory.

#!/usr/bin/env python3

import fitz

doc = fitz.open('pdfs/some.pdf')

image_xrefs = {}

for page in doc:
    for image in page.get_images():
        image_xrefs.setdefault(image[0])

for index, xref in enumerate(image_xrefs):
    img = doc.extract_image(xref)
    if img:
        with open(f'images/{index}.{img["ext"]}', 'wb') as image:
            image.write(img['image'])

Not all PDFs are simplly just text and image so in this Question case there is a hybrid as seen when the area around the figure image zone is selected. The hint is the file says Adobe Paper Capture so was OCRed and not all text was captured.! The OP expected the figure to be extractable from within the whole page.

"it tools the screenshot of the whole image instead of the image."

在此处输入图像描述

Hsps on the cellw ar surface Dead cells were gated by staining with propidium iodide.
~
(a) Control
~
cv
Ml
76.55
49.94
§
1-
M2
0.21
12.11
93.53
9.65
~
.. .,
"'
(b) Experimental
<I
Ml
3.49
100
10'
104
M2
93.31
232.80
99.24
283.87
Fig. 2a. Flow cytometric analysis of expression of GroEL on the surface of vegetative cells of B.

Using any pdfimage query tool we see that page has more silly entries than valid ones

pdfimages  -list -f 12 -l 12 -verbose "09_chapter 4.pdf" -
[processing page 12]
--0000.pbm: page=12 width=2412 height=3436 hdpi=300.00 vdpi=300.00 colorspace=DeviceGray bpc=1
--0001.pbm: page=12 width=1 height=1 hdpi=0.44 vdpi=2.03 mask bpc=1
--0002.pbm: page=12 width=1 height=1 hdpi=0.53 vdpi=2.59 mask bpc=1
--0003.pbm: page=12 width=1 height=1 hdpi=0.49 vdpi=2.27 mask bpc=1

and extract images will simply extract the scanned page and three files that are simply a 1x1 pixel dot. Thus the outputs will look like only 25 % was recovered but not as the OP expected a source diagram/figure.

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM