Using Python, how to extract text and images from PDF + color strings and numbers from the output txt file

Question

Using Python, I would like to

extract text from a PDF into a txt file (done)
color all numbers and specific strings of the txt file like this example ( https://tex.stackexchange.com/questions/521383/how-to-highlight-numbers-only-outside-a-string-in-lstlisting ) (not done)
extract images from the PDF file into PNGs/or a new PDF file containing all of the images (not done)

To perform 1. I used the following code which is working

pip install PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter

file_path = 'AR_Finland_2021.pdf'
pdf = PdfFileReader(file_path)

with open('AR_Finland_2021.txt', 'w') as f:
    for page_num in range(pdf.numPages):
        # print('Page: {0}'.format(page_num))
        pageObj = pdf.getPage(page_num)

        try: 
            txt = pageObj.extractText()
            print(''.center(100, '-'))
        except:
            pass
        else:
            f.write('Page {0}\n'.format(page_num+1))
            f.write(''.center(100, '-'))
            f.write(txt)
    f.close()

To perform 3 (extract images) I tried the following code but always get an error.

pip install PyMuPDF Pillow
pip install PyMuPDF
pip install python-gettext
import fitz 
import io
from PIL import Image
# file path you want to extract images from
file = "AR_Finland_2021.pdf"
# open the file
pdf_file = fitz.open(file)
# iterate over PDF pages
for page_index in range(len(pdf_file)):
    # get the page itself
    page = pdf_file[page_index]
    image_list = page.getImageList()
    # printing number of images found in this page
    if image_list:
        print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
    else:
        print("[!] No images found on page", page_index)
    for image_index, img in enumerate(page.getImageList(), start=1):
        # get the XREF of the image
        xref = img[0]
        # extract the image bytes
        base_image = pdf_file.extractImage(xref)
        image_bytes = base_image["image"]
        # get the image extension
        image_ext = base_image["ext"]
        # load it to PIL
        image = Image.open(io.BytesIO(image_bytes))
        # save it to local disk
        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

Error:

----> 5     image_list = page.getImageList()
AttributeError: 'Page' object has no attribute 'getImageList'

Would someone know how to perform 3 (extract images) and 2 (color numbers and certain strings from the txt file extracted from the PDF)?

Answer 1

You can do:

import fitz

doc = fitz.open("AR_Finland_2021.pdf")

for page in doc:
    for img_tuple in page.get_images():
        img_dict = doc.extract_image(img_tuple[0])
        img_bytes = img_dict['image']
        # Do whatever you want with it

See Page.get_images() and Document.extract_image()

Using Python, how to extract text and images from PDF + color strings and numbers from the output txt file

Question

1 answers

solution1
0 2022-07-27 08:57:12

Using Python, how to extract text and images from PDF + color strings and numbers from the output txt file

Question

1 answers

solution1 0 2022-07-27 08:57:12

solution1
0 2022-07-27 08:57:12