简体   繁体   中英

Extract products and prices from invoice

I want to extract information from pdfs.

The following is an extract from a policy, where the pdf is converted to a txt document using https://github.com/yob/pdf-reader/ .

Vehicle Description          2007, PORSCHE, CAYMAN 3.2

Registration Number          USD-2394                   Vin Number            FSDFKJL23123KFAS


MY COVER DETAILS

Cover                                                                                 USD37.45

I would like to extract eg the Vehicle description and cost of cover:

vehicle.description => "2007, PORSCHE, CAYMAN 3.2"
vehicle.registration => "USD-2394"
vehicle.cost_of_cover => "37.45"

Can anyone please advise on the appropriate method. The problem is that the layout of the policy might change but the data will mostly be the same, just with different values.

If regex is the way to go can anyone just provide example code.

Finding the description

/Vehicle Description((?!Registration$).*)Registration/m

Finding the Registration Number

/Registration Number((?!Vin$).*)Vin/m

Finding the cost of cover

/Cover(.*)/m

These are all pretty lazy regex matches. However you did not provide a lot of different samples. But these should get you started.

Example Usage:

match = /Vehicle Description((?!Registration$).*)Registration/m.match(PDFTEXT)

http://www.ruby-doc.org/core-2.0/Regexp.html

You can do this pretty easily with Regular Expressions (regexp). Assume that your pdf text is stored in the variable text :

description = text.scan(/Vehicle Description(.*)Registration/m).flatten[0].strip
registration = text.scan(/Registration Number(.*)Vin/m).flatten[0].strip
cover = text.scan(/Cover(.*)/m).flatten[0].strip

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM