Using Python, I would like to
To perform 1. I used the following code which is working
pip install PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter
file_path = 'AR_Finland_2021.pdf'
pdf = PdfFileReader(file_path)
with open('AR_Finland_2021.txt', 'w') as f:
for page_num in range(pdf.numPages):
# print('Page: {0}'.format(page_num))
pageObj = pdf.getPage(page_num)
try:
txt = pageObj.extractText()
print(''.center(100, '-'))
except:
pass
else:
f.write('Page {0}\n'.format(page_num+1))
f.write(''.center(100, '-'))
f.write(txt)
f.close()
To perform 3 (extract images) I tried the following code but always get an error.
pip install PyMuPDF Pillow
pip install PyMuPDF
pip install python-gettext
import fitz
import io
from PIL import Image
# file path you want to extract images from
file = "AR_Finland_2021.pdf"
# open the file
pdf_file = fitz.open(file)
# iterate over PDF pages
for page_index in range(len(pdf_file)):
# get the page itself
page = pdf_file[page_index]
image_list = page.getImageList()
# printing number of images found in this page
if image_list:
print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
else:
print("[!] No images found on page", page_index)
for image_index, img in enumerate(page.getImageList(), start=1):
# get the XREF of the image
xref = img[0]
# extract the image bytes
base_image = pdf_file.extractImage(xref)
image_bytes = base_image["image"]
# get the image extension
image_ext = base_image["ext"]
# load it to PIL
image = Image.open(io.BytesIO(image_bytes))
# save it to local disk
image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
Error:
----> 5 image_list = page.getImageList()
AttributeError: 'Page' object has no attribute 'getImageList'
Would someone know how to perform 3 (extract images) and 2 (color numbers and certain strings from the txt file extracted from the PDF)?
You can do:
import fitz
doc = fitz.open("AR_Finland_2021.pdf")
for page in doc:
for img_tuple in page.get_images():
img_dict = doc.extract_image(img_tuple[0])
img_bytes = img_dict['image']
# Do whatever you want with it
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.