[英]Extracting images from pdf using Python
How can we extract images(only images) from PDF.我们如何从 PDF 中提取图像(仅图像)。
I used many online tools, they all are not universal.我使用了很多在线工具,它们都不是通用的。 In most of the PDF, it tools the screenshot of the whole image instead of the image.
在大多数 PDF 中,它使用整个图像的屏幕截图而不是图像。 PDF link -> sg.inflibnet.ac.in:8080/jspui/bitstream/10603/121661/9/09_chapter 4.pdf
PDF 链接-> sg.inflibnet.ac.in:8080/jspui/bitstream/10603/121661/9/09_chapter 4.pdf
Here's a solution with PyMuPDF:这是 PyMuPDF 的解决方案:
#!python3.6
import fitz # PyMuPDF
def get_pixmaps_in_pdf(pdf_filename):
doc = fitz.open(pdf_filename)
xrefs = set()
for page_index in range(doc.pageCount):
for image in doc.getPageImageList(page_index):
xrefs.add(image[0]) # Add XREFs to set so duplicates are ignored
pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
doc.close()
return pixmaps
def write_pixmaps_to_pngs(pixmaps):
for i, pixmap in enumerate(pixmaps):
pixmap.writePNG(f'{i}.png') # Might want to come up with a better name
pixmaps = get_pixmaps_in_pdf(r'C:\StackOverflow\09_chapter 4.pdf')
write_pixmaps_to_pngs(pixmaps)
Here is some code that reads a PDF-File using pyPdf, extracts images and yields them as a PIL.Image
.这是一些使用 pyPdf 读取 PDF 文件、提取图像并将它们生成为
PIL.Image
的代码。 You need to modify it to your needs, it's just here to demonstrate how to walk the object tree.您需要根据需要对其进行修改,这里只是为了演示如何遍历 object 树。
import io
import pyPdf
import PIL.Image
infile_name = 'my.pdf'
with open(infile_name, 'rb') as in_f:
in_pdf = pyPdf.PdfFileReader(in_f)
for page_no in range(in_pdf.getNumPages()):
page = in_pdf.getPage(page_no)
# Images are part of a page's `/Resources/XObject`
r = page['/Resources']
if '/XObject' not in r:
continue
for k, v in r['/XObject'].items():
vobj = v.getObject()
# We are only interested in images...
if vobj['/Subtype'] != '/Image' or '/Filter' not in vobj:
continue
if vobj['/Filter'] == '/FlateDecode':
# A raw bitmap
buf = vobj.getData()
# Notice that we need metadata from the object
# so we can make sense of the image data
size = tuple(map(int, (vobj['/Width'], vobj['/Height'])))
img = PIL.Image.frombytes('RGB', size, buf,
decoder_name='raw')
# Obviously we can't really yield here, do something with `img`...
yield img
elif vobj['/Filter'] == '/DCTDecode':
# A compressed image
img = PIL.Image.open(io.BytesIO(vobj._data))
yield img
Other solutions didn't work for me, so here's my solution:其他解决方案对我不起作用,所以这是我的解决方案:
Install PyMuPDF with:使用以下命令安装 PyMuPDF:
pip install pymupdf
Create and run following script.创建并运行以下脚本。 This script assumes that PDF is stored in
pdfs
directory and extracted images needs to be stored in images
directory inside current directory.此脚本假设 PDF 存储在
pdfs
目录中,提取的图像需要存储在当前目录内的images
目录中。
#!/usr/bin/env python3
import fitz
doc = fitz.open('pdfs/some.pdf')
image_xrefs = {}
for page in doc:
for image in page.get_images():
image_xrefs.setdefault(image[0])
for index, xref in enumerate(image_xrefs):
img = doc.extract_image(xref)
if img:
with open(f'images/{index}.{img["ext"]}', 'wb') as image:
image.write(img['image'])
Not all PDFs are simplly just text and image so in this Question case there is a hybrid as seen when the area around the figure image zone is selected. Not all PDFs are simplly just text and image so in this Question case there is a hybrid as seen when the area around the figure image zone is selected. The hint is the file says Adobe Paper Capture so was OCRed and not all text was captured.!
提示是文件说 Adobe Paper Capture 所以是 OCRed 并且并非所有文本都被捕获。! The OP expected the figure to be extractable from within the whole page.
OP 期望该数字可以从整个页面中提取。
"it tools the screenshot of the whole image instead of the image."
“它使用整个图像而不是图像的屏幕截图。”
Hsps on the cellw ar surface Dead cells were gated by staining with propidium iodide.
~
(a) Control
~
cv
Ml
76.55
49.94
§
1-
M2
0.21
12.11
93.53
9.65
~
.. .,
"'
(b) Experimental
<I
Ml
3.49
100
10'
104
M2
93.31
232.80
99.24
283.87
Fig. 2a. Flow cytometric analysis of expression of GroEL on the surface of vegetative cells of B.
Using any pdfimage query tool we see that page has more silly entries than valid ones使用任何 pdfimage 查询工具,我们看到该页面的愚蠢条目比有效条目多
pdfimages -list -f 12 -l 12 -verbose "09_chapter 4.pdf" -
[processing page 12]
--0000.pbm: page=12 width=2412 height=3436 hdpi=300.00 vdpi=300.00 colorspace=DeviceGray bpc=1
--0001.pbm: page=12 width=1 height=1 hdpi=0.44 vdpi=2.03 mask bpc=1
--0002.pbm: page=12 width=1 height=1 hdpi=0.53 vdpi=2.59 mask bpc=1
--0003.pbm: page=12 width=1 height=1 hdpi=0.49 vdpi=2.27 mask bpc=1
and extract images will simply extract the scanned page and three files that are simply a 1x1 pixel dot.和提取图像将简单地提取扫描的页面和三个文件,这些文件只是一个 1x1 像素点。 Thus the outputs will look like only 25 % was recovered but not as the OP expected a source diagram/figure.
因此,输出看起来只有 25% 被恢复,但不像 OP 预期的源图/图。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.