Extract products and prices from invoice

Question

I want to extract information from pdfs.

The following is an extract from a policy, where the pdf is converted to a txt document using https://github.com/yob/pdf-reader/ .

Vehicle Description          2007, PORSCHE, CAYMAN 3.2

Registration Number          USD-2394                   Vin Number            FSDFKJL23123KFAS


MY COVER DETAILS

Cover                                                                                 USD37.45

I would like to extract eg the Vehicle description and cost of cover:

vehicle.description => "2007, PORSCHE, CAYMAN 3.2"
vehicle.registration => "USD-2394"
vehicle.cost_of_cover => "37.45"

Can anyone please advise on the appropriate method. The problem is that the layout of the policy might change but the data will mostly be the same, just with different values.

If regex is the way to go can anyone just provide example code.

Answer 1

Finding the description

/Vehicle Description((?!Registration$).*)Registration/m

Finding the Registration Number

/Registration Number((?!Vin$).*)Vin/m

Finding the cost of cover

/Cover(.*)/m

These are all pretty lazy regex matches. However you did not provide a lot of different samples. But these should get you started.

Example Usage:

match = /Vehicle Description((?!Registration$).*)Registration/m.match(PDFTEXT)

http://www.ruby-doc.org/core-2.0/Regexp.html

Answer 2

You can do this pretty easily with Regular Expressions (regexp). Assume that your pdf text is stored in the variable text :

description = text.scan(/Vehicle Description(.*)Registration/m).flatten[0].strip
registration = text.scan(/Registration Number(.*)Vin/m).flatten[0].strip
cover = text.scan(/Cover(.*)/m).flatten[0].strip

Extract products and prices from invoice

Question

2 answers

solution1
1 ACCPTED 2013-06-19 23:15:03

solution2
0 2013-06-19 23:31:05

Extract products and prices from invoice

Question

2 answers

solution1 1 ACCPTED 2013-06-19 23:15:03

solution2 0 2013-06-19 23:31:05

solution1
1 ACCPTED 2013-06-19 23:15:03

solution2
0 2013-06-19 23:31:05