简体   繁体   中英

Information retrieval from unstructured text files by machine learning

So I have a bunch of .txt files that are extracts of PDFs as strings like so:

---
Name:
ID Number:
--
CONFIDENTIAL
.
Date:
Description:
Foo Bar
ABC456789
THIS PAGE INTENTIONALLY LEFT BLANK.
05/04/17
Lorem ipsum dolor sit amet

Among all this noise, I would like to extract a couple target fields and ignore the rest of the information:

Name: Foo Bar
ID Number: ABC456789
Date: 05/04/17
Description: Lorem ipsum dolor sit amet

So most of the documents I am dealing with have the same format, therefore so far, it was possible to make note of the line numbers at which the target values appear and save those. Of course, this is a crude solution because there are various formats that will be parsed differently to .txt. It seems like it would be possible to extract information through machine learning, since I have done a lot of this by hand and therefore have sufficient training data. And any new file format that comes up, I can manually train that also. For a given ML algorithm, how would you supervise it and supply it this pattern?

Some ideas that I have you could challenge:

  • Regex is also a feasible option but it doesn't work for everything because ID numbers do not follow the same format; it can sometimes be 1234567 as well as ABC456789. Maybe the ML can be trained to come up with its own Regex sequences based on what it is trained for. I think this might be relevant but I'm unsure how: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
  • I could use Tabula to detect tables in the PDF and replace the unstructured table with CSV inside the text file before performing any ML.
  • A CNN or CRF is suited for data like this.

I know it's an opinionated question (and that this cannot be done overnight) but I would appreciate any cues!

If the original PDF file comes in table format, I would suggest using table extraction because that will be the most reliable way to ensure you get the correct fields, based on the information you shared above.

A CNN or CRF seems like overkill to me for such a simple example. A simple decision tree or any off-the-shelf supervised ML approach would probably suffice (again, based on the example you share above).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM