So I have a bunch of .txt files that are extracts of PDFs as strings like so:
---
Name:
ID Number:
--
CONFIDENTIAL
.
Date:
Description:
Foo Bar
ABC456789
THIS PAGE INTENTIONALLY LEFT BLANK.
05/04/17
Lorem ipsum dolor sit amet
Among all this noise, I would like to extract a couple target fields and ignore the rest of the information:
Name: Foo Bar
ID Number: ABC456789
Date: 05/04/17
Description: Lorem ipsum dolor sit amet
So most of the documents I am dealing with have the same format, therefore so far, it was possible to make note of the line numbers at which the target values appear and save those. Of course, this is a crude solution because there are various formats that will be parsed differently to .txt. It seems like it would be possible to extract information through machine learning, since I have done a lot of this by hand and therefore have sufficient training data. And any new file format that comes up, I can manually train that also. For a given ML algorithm, how would you supervise it and supply it this pattern?
Some ideas that I have you could challenge:
I know it's an opinionated question (and that this cannot be done overnight) but I would appreciate any cues!
If the original PDF file comes in table format, I would suggest using table extraction because that will be the most reliable way to ensure you get the correct fields, based on the information you shared above.
A CNN or CRF seems like overkill to me for such a simple example. A simple decision tree or any off-the-shelf supervised ML approach would probably suffice (again, based on the example you share above).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.