简体   繁体   中英

pdf2image how to read pdfs with “enable all features” - windows

I have a pdf and i would like to read it in Python. When I open it on my machine using acrobat, I get below message and when I click on "enable all features", the file shows it's actual content. 在此处输入图片说明 在此处输入图片说明

When I try to read it in python, how could I achieve the same action so that python reads the actual text and doesn't read the below text

"Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download . For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader . Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the US and other countries. "

My code is as below

from PIL import Image
import pytesseract

homepath = r'C:\Users\xxxx\\'


files = "bbbb.pdf"
PDFfilename = homepath  + files

from pdf2image import convert_from_path
pages = convert_from_path(PDFfilename, 500)

i=1
for page in pages:
    page.save(homepath +'out'+str(i)+'.jpg', 'JPEG')
    text = pytesseract.image_to_string(Image.open(homepath +'out'+str(i)+'.jpg'))
    print(text)
    i=i+1

The "Please wait..." page you see is the only actual pdf-style content of your pdf (ie a pdf page object with a content stream and resources etc.).

What you get to see after enabling all features , are the contents of a XFA form contained in the pdf.

XFA (also known as XFA forms ) stands for XML Forms Architecture , a family of proprietary XML specifications that was suggested and developed by JetForm to enhance the processing of web forms. It can be also used in PDF files starting with the PDF 1.5 specification. The XFA specification is referenced as an external specification necessary for full application of the ISO 32000-1 specification (PDF 1.7). The XML Forms Architecture was not standardized as an ISO standard, and has been deprecated in PDF 2.0.

( Wikipedia on XFA )

Most PDF processors do not handle XFA content. In particular most free or open pdf libraries don't.

What you can do, though, as long as your pdf library allows direct access to low-level pdf objects, is retrieve the XFA XML and analyze it as XML stream.

It is located in the Catalog -> AcroForm -> XFA object:

The XFA entry shall be either a stream containing the entire XFA resource or an array specifying individual packets that together make up the entire XFA resource. [...]

A packet is a pair of string and stream. The string contains the name of the XML element and the stream contains the complete text of the XML element.

(ISO 32000-1 section 12.7.8 XFA Forms)

I am not very familiar with pdf2image , but I'm relativity familiar with pikepdf . All you have to do is save the file as another file with it. Here is a snippet:

import pikepdf

pdf = pikepdf.open('mypdf.pdf')
pdf.save('my_good_pdf.pdf')

That should fix it; When you open my_good_pdf.pdf it will be totally fine.

Try with pdfminer ( https://github.com/pdfminer/pdfminer.six )

With Python 3, install like this:

pip install pdfminer-six
pip install chardet

Then:

import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage


def process_file(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
            page_interpreter.process_page(page)
        text = fake_file_handle.getvalue()
    # close open handles
    converter.close()
    fake_file_handle.close()
    if text:
        return text


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM