简体   繁体   中英

Open and Parse Dynamic XFA (XML Form Architecture) PDF with Python

I would like to parse some text or any data from this pdf with Python. Everything I have tried is not working.

I have a tried a variety of approaches:

# importing required modules
import PyPDF2
  
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
  
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
  
# printing number of pages in pdf file
print(pdfReader.numPages)
  
# creating a page object
pageObj = pdfReader.getPage(0)
  
# extracting text from page
print(pageObj.extractText())
  
# closing the pdf file object
pdfFileObj.close()

I receive this: If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download . For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader .
Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the US and other countries.

I have tried:

from pdfrw import PdfReader
pdf = PdfReader("example.pdf")

I receive this: [ERROR] uncompress.py:80 Error -3 while decompressing data: incorrect header check (111, 0) [ERROR] uncompress.py:80 Error -3 while decompressing data: incorrect header check (110, 0) [ERROR] uncompress.py:80 Error -3 while decompressing data: incorrect header check (109, 0) [ERROR] uncompress.py:80 Error -3 while decompressing data: incorrect header check (108, 0) [ERROR] uncompress.py:80 Error -3 while decompressing data: incorrect header check (112, 0) [ERROR] uncompress.py:80 Error -3 while decompressing data: incorrect header check (113, 0)

Selenium webdriver could be used as an option if browser is capable of showing the PDF. Open PDF with browser and inspect it as an HTML page to figure out XPath of interesting elements.
This answer uses a publicly available XFA PDF.

from selenium import webdriver
import os
import time
from lxml import html

browser = webdriver.Firefox()
#html_file = "https://raw.githubusercontent.com/itext/i7js-examples/develop/src/main/resources/pdfs/xfa_invoice_example.pdf"
html_file = "file:///home/lmc/tmp/xfa_invoice_example.pdf"
browser.get(html_file)

try:
    time.sleep(10)
    pageSource = browser.page_source
    doc = html.fromstring(pageSource)

    results = doc.xpath('//*[@data-element-id="subform1184"]//div[@class="xfaRich"]/span/text()')
    for text in results:
        print(text)
finally:
    browser.quit()

Result

Through arcane incantations and blakc magics, your HTML and CSS will be transformed into mesmerizing pdfs
iText7 pdfHTML
Additional Order
Remove Last order

If you try with pdfminer.six ( https://pdfminersix.readthedocs.io/en/latest/index.html ) -> Text extract is not allowed from your shared PDF: PERMIT MADE OUTSIDE OF CANADA; Contains also JavaScript!

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("example.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())
 

Output:

The PDF <_io.BufferedReader name='example.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
Please wait...

But you can dump the XML, if this helps with the command line tool: dumppdf.py -a example.pdf >PDF_TEXT.xml

Output:

<?xml version="1.0"?>
<pdf>
<object id="63">
  <dict size="12">
    <key>AcroForm</key>
    <value>
      <ref id="71" />
    </value>
    <key>DSS</key>
    <value>
      <ref id="129" />
    </value>
    <key>Extensions</key>
    <value>
      <dict size="1">
        <key>ADBE</key> ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM