Is there any way to extract images as stream from pdf document (using PyPDF2 library)? Also is it possible to replace some images to another (generated with PIL for example or loaded from file)?
I'm able to get EncodedStreamObject from pdf objects tree and get encoded stream (by calling getData() method), but looks like it just raw content w/o any image headers and other meta information.
>>> import PyPDF2
>>> # sample.pdf contains png images
>>> reader = PyPDF2.PdfFileReader(open('sample.pdf', 'rb'))
>>> reader.resolvedObjects[0][9]
{'/BitsPerComponent': 8,
'/ColorSpace': ['/ICCBased', IndirectObject(20, 0)],
'/Filter': '/FlateDecode',
'/Height': 30,
'/Subtype': '/Image',
'/Type': '/XObject',
'/Width': 100}
>>>
>>> reader.resolvedObjects[0][9].__class__
PyPDF2.generic.EncodedStreamObject
>>>
>>> s = reader.resolvedObjects[0][9].getData()
>>> len(s), s[:10]
(9000, '\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc')
I've looked across PyPDF2 , ReportLab and PDFMiner solutions quite a bit, but haven't found anything like what I'm looking for.
Any code samples and links will be very helpful.
import fitz
doc = fitz.open(filePath)
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
Image metadata is not stored within the encoded images of a PDF. If metadata is stored at all, it is stored in PDF itself, but stripped from the underlying image. The metadata you see in your example is likely all that you'll be able to get. It's possible that PDF encoders may store image metadata elsewhere in the PDF, but I haven't seen this. (Note this metadata question was also asked for Java .)
It's definitely possible to extract the stream however, as you mentioned, you use the getData
operation.
As for replacing it, you'll need to create a new image object with the PDF, add it to the end, and update the indirect Object pointers accordingly. It will be difficult to do this with PyPdf2.
pip install PyMuPDF
import fitz
import io
from PIL import Image
#file path you want to extract images from
file = r"File_path"
#open the file
pdf_file = fitz.open(file)
#iterate over PDF pages
for page_index in range(pdf_file.page_count):
#get the page itself
page = pdf_file[page_index]
image_li = page.get_images()
#printing number of images found in this page
#page index starts from 0 hence adding 1 to its content
if image_li:
print(f"[+] Found a total of {len(image_li)} images in page {page_index+1}")
else:
print(f"[!] No images found on page {page_index+1}")
for image_index, img in enumerate(page.get_images(), start=1):
#get the XREF of the image
xref = img[0]
#extract the image bytes
base_image = pdf_file.extract_image(xref)
image_bytes = base_image["image"]
#get the image extension
image_ext = base_image["ext"]
#load it to PIL
image = Image.open(io.BytesIO(image_bytes))
#save it to local disk
image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
`
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.