I have extracted some invoice PDF in text format using pyPDF2. I want to convert this text file into a json file that contains only the important keywords and tokens.
the output should be something like:
#PurchaseOrder
{
"doctype":"PO",
"orderingcompany":"Demo Company",
"suppliercompany":"Demo Company",
"shipto":"Test Customer",
"ponum":"PO1234",
"podate":"01-01-2019",
"totalamount":"$1234.50",
"currency":"SGD"
}
A sample text that I have obtained from a pdf is:
PACE MEMBERSHIP WARE HOUSE
4115 Whispering Pines Circle
Grand Prairie, TX 75051
7141
56929268
PURCHASE ORDER
TO:
Elmer A. Hua
A+ Investments
1223 Cerullo Road
Lexington, KY 40507
[Phone Number]
SHIP TO:
Laurel Yan
Pace Membership Warehouse
4115 Whispering Pines Circle
Grand Prairie, TX 75051
7141
PO NUMBER:
19081
[The PO number must appear on all related correspondence, shipping papers, and invoices]
PO DATE
REQUISITIONER
SHIPPED VIA
FOB POINT
TERMS
7/15/2006
QTY
UNIT
DESCRIPTION
UNIT PRICE
TOTAL (SGD)
100.00
1
Interlock Drifit Round Neck, ILRN
13.50
1,350.00
SUBTOTAL
1,350.00
SALES TAX
200.00
1.
Please send two copies of your invoice.
2.
Enter this order in accordance with the prices, terms, delivery method, and specifications listed above.
3.
Please notify us immediately if you are unable to ship as specified.
4.
Send all correspondence to:
Laurel Yan
4115 Whispering Pines Circle
Gra nd Prairie, TX 75051
7141
56929268
SHIPPING AND HANDLIN G
OTHER
TOTAL
1,550.00
Authorized by Laurel Yan
7/15/2006
You have provided the text there might be a good idea to edit your post to remove the addresses
to answer your question your going to have to loop through this text line by line and record the sections you need and save these to json.
and if you just want to get a sub set of the page by location then this has been asked before How to extract text from a Specific Area in a PDF using Python?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.