I want to extract information from pdfs.
The following is an extract from a policy, where the pdf is converted to a txt document using https://github.com/yob/pdf-reader/ .
Vehicle Description 2007, PORSCHE, CAYMAN 3.2
Registration Number USD-2394 Vin Number FSDFKJL23123KFAS
MY COVER DETAILS
Cover USD37.45
I would like to extract eg the Vehicle description and cost of cover:
vehicle.description => "2007, PORSCHE, CAYMAN 3.2"
vehicle.registration => "USD-2394"
vehicle.cost_of_cover => "37.45"
Can anyone please advise on the appropriate method. The problem is that the layout of the policy might change but the data will mostly be the same, just with different values.
If regex is the way to go can anyone just provide example code.
Finding the description
/Vehicle Description((?!Registration$).*)Registration/m
Finding the Registration Number
/Registration Number((?!Vin$).*)Vin/m
Finding the cost of cover
/Cover(.*)/m
These are all pretty lazy regex matches. However you did not provide a lot of different samples. But these should get you started.
Example Usage:
match = /Vehicle Description((?!Registration$).*)Registration/m.match(PDFTEXT)
You can do this pretty easily with Regular Expressions (regexp). Assume that your pdf text is stored in the variable text
:
description = text.scan(/Vehicle Description(.*)Registration/m).flatten[0].strip
registration = text.scan(/Registration Number(.*)Vin/m).flatten[0].strip
cover = text.scan(/Cover(.*)/m).flatten[0].strip
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.