简体   繁体   中英

How do i format text extracted from PDF to json in python

I have extracted some invoice PDF in text format using pyPDF2. I want to convert this text file into a json file that contains only the important keywords and tokens.

the output should be something like:

#PurchaseOrder

{

"doctype":"PO",

"orderingcompany":"Demo Company",

"suppliercompany":"Demo Company",

"shipto":"Test Customer",

"ponum":"PO1234",

"podate":"01-01-2019",

"totalamount":"$1234.50",

"currency":"SGD"

}

A sample text that I have obtained from a pdf is:

PACE MEMBERSHIP WARE HOUSE

4115 Whispering Pines Circle

Grand Prairie, TX 75051

972

336

7141

56929268

PURCHASE ORDER

TO:

Elmer A. Hua

A+ Investments

1223 Cerullo Road

Lexington, KY 40507

[Phone Number]

SHIP TO:

Laurel Yan

Pace Membership Warehouse

4115 Whispering Pines Circle

Grand Prairie, TX 75051

972

336

7141

PO NUMBER:

PO/18

19081

[The PO number must appear on all related correspondence, shipping papers, and invoices]

PO DATE

REQUISITIONER

SHIPPED VIA

FOB POINT

TERMS

7/15/2006

QTY

UNIT

DESCRIPTION

UNIT PRICE

TOTAL (SGD)

100.00

1

Interlock Drifit Round Neck, ILRN

13.50

1,350.00

SUBTOTAL

1,350.00

SALES TAX

200.00

1.

Please send two copies of your invoice.

2.

Enter this order in accordance with the prices, terms, delivery method, and specifications listed above.

3.

Please notify us immediately if you are unable to ship as specified.

4.

Send all correspondence to:

Laurel Yan

4115 Whispering Pines Circle

Gra nd Prairie, TX 75051

972

336

7141

56929268

SHIPPING AND HANDLIN G

OTHER

TOTAL

1,550.00

Authorized by Laurel Yan

7/15/2006

You have provided the text there might be a good idea to edit your post to remove the addresses

to answer your question your going to have to loop through this text line by line and record the sections you need and save these to json.

and if you just want to get a sub set of the page by location then this has been asked before How to extract text from a Specific Area in a PDF using Python?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM