简体   繁体   中英

Reading a PDF form file and returning fill-able field co-ordinates and field name

I have a PDF file which is essentially a form. I need to return the fill-able places; what fields to fill,their page number and their co-ordinates where I can place a bounding box.

I have followed various approach to handle the problem but as it turns out, working with PDF is very difficult.

Details about the PDF file:

from pdfrw import PdfReader
pdf = PdfReader('RED-46808(Short).pdf')
print(pdf.keys())
print(pdf.Info)
print(pdf.Root.keys())
print('PDF has {} pages'.format(len(pdf.pages)))

Which returns:

['/Root', '/Info', '/ID', '/Size']
{'/CreationDate': "(D:20171003184937+08'00')", '/Creator': '(Microsoft® Word 2013)', '/ModDate': '(D:20200214163844Z)', '/Producer': '(Microsoft® Word 2013)'}
['/AcroForm', '/Lang', '/MarkInfo', '/Metadata', '/Names', '/OutputIntents', '/Pages', '/StructTreeRoot', '/Type']
PDF has 5 pages

What I've done so far is; I can read the pages and fill the form which is a hit or miss most of the time, but I don't want to fill the form, i just need to get the co-ordinates of where the form should be filled and place a bounding box at the appropriate places.

import os
import pdfrw


INVOICE_TEMPLATE_PATH = 'RED-46808(Short).pdf'
INVOICE_OUTPUT_PATH = 'output.pdf'


ANNOT_KEY = '/Annots'
ANNOT_FIELD_KEY = '/T'
ANNOT_VAL_KEY = '/V'
ANNOT_RECT_KEY = '/Rect'
SUBTYPE_KEY = '/Subtype'
WIDGET_SUBTYPE_KEY = '/Widget'


def write_fillable_pdf(input_pdf_path, output_pdf_path, data_dict):
    template_pdf = pdfrw.PdfReader(input_pdf_path)
    annotations = template_pdf.pages[0][ANNOT_KEY]
    for annotation in annotations:
        if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:
            if annotation[ANNOT_FIELD_KEY]:
                key = annotation[ANNOT_FIELD_KEY][1:-1]
                if key in data_dict.keys():
                    annotation.update(
                        pdfrw.PdfDict(V='{}'.format(data_dict[key]))
                    )
#     pdfrw.PdfDict(AP=data_dict[key], V=data_dict[key])
    pdfrw.PdfWriter().write(output_pdf_path, template_pdf)



data_dict = {
   'business_name_1': 'Bostata',
   'customer_name': 'company.io',
   'customer_email': 'joe@company.io',
   'invoice_number': '102394',
   'send_date': '2018-02-13',
   'due_date': '2018-03-13',
   'note_contents': 'Thank you for your business, Joe',
   'item_1': 'Data consulting services',
   'item_1_quantity': '10 hours',
   'item_1_price': '$200/hr',
   'item_1_amount': '$2000',
   'subtotal': '$2000',
   'tax': '0',
   'discounts': '0',
   'total': '$2000',
   'business_name_2': 'Bostata LLC',
   'business_email_address': 'hi@bostata.com',
   'business_phone_number': '(617) 930-4294'
}

if __name__ == '__main__':
    write_fillable_pdf(INVOICE_TEMPLATE_PATH, INVOICE_OUTPUT_PATH, data_dict)

The above code not always returns a PDF filled with the marked fields, not particularly helpful. I don't know where to go from here. If anyone can help me because I've exhausted almost all the resources at my disposal. I'm new to working with PDFs.

Try working with pdfminer if you haven't! It has awesome support and lot of great features.

You can also try using PyMuPDF which can help you locate text and also PyPDF2 for highlighting stuff. It won't create a bounding box, but you can probably enter some text next to the unfilled tabs like - "empty field" and highlight it which would work kinda in an alternative manner to what you require.

I am not sure if any pdf based packages in python can probably create bounding boxes.

For exclusively creating a bounding box, you might have to convert the pdf into an image, identify the unfilled tab in the image and then draw the bounding box using packages like OpenCV or something, which will take a lot of pains and I am not sure about if this method will always work and be feasible in the long run. And then again you also need to convert that image back into a pdf. So, that's a pretty long pipeline.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM