简体   繁体   English

通过机器学习从非结构化文本文件中检索信息

[英]Information retrieval from unstructured text files by machine learning

So I have a bunch of .txt files that are extracts of PDFs as strings like so: 因此,我有一堆.txt文件,它们是像这样的字符串作为PDF的一部分:

---
Name:
ID Number:
--
CONFIDENTIAL
.
Date:
Description:
Foo Bar
ABC456789
THIS PAGE INTENTIONALLY LEFT BLANK.
05/04/17
Lorem ipsum dolor sit amet

Among all this noise, I would like to extract a couple target fields and ignore the rest of the information: 在所有这些噪音中,我想提取几个目标字段,而忽略其余信息:

Name: Foo Bar
ID Number: ABC456789
Date: 05/04/17
Description: Lorem ipsum dolor sit amet

So most of the documents I am dealing with have the same format, therefore so far, it was possible to make note of the line numbers at which the target values appear and save those. 因此,我要处理的大多数文档都具有相同的格式,因此到目前为止,可以记录出现目标值的行号并将其保存。 Of course, this is a crude solution because there are various formats that will be parsed differently to .txt. 当然,这是一个粗略的解决方案,因为有多种格式将与.txt解析不同。 It seems like it would be possible to extract information through machine learning, since I have done a lot of this by hand and therefore have sufficient training data. 似乎有可能通过机器学习来提取信息,因为我手工完成了很多工作,因此有足够的训练数据。 And any new file format that comes up, I can manually train that also. 以及出现的任何新文件格式,我也可以手动进行培训。 For a given ML algorithm, how would you supervise it and supply it this pattern? 对于给定的ML算法,您将如何监督它并提供这种模式?

Some ideas that I have you could challenge: 我有一些想法可以挑战:

  • Regex is also a feasible option but it doesn't work for everything because ID numbers do not follow the same format; 正则表达式也是一种可行的选择,但是它不适用于所有事物,因为ID号并不遵循相同的格式。 it can sometimes be 1234567 as well as ABC456789. 有时可以是1234567以及ABC456789。 Maybe the ML can be trained to come up with its own Regex sequences based on what it is trained for. 也许可以训练ML根据训练的目的提出自己的Regex序列。 I think this might be relevant but I'm unsure how: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html 我认为这可能是相关的,但是我不确定如何: http : //alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
  • I could use Tabula to detect tables in the PDF and replace the unstructured table with CSV inside the text file before performing any ML. 在执行任何ML之前,我可以使用Tabula来检测PDF中的表,并在文本文件内用CSV替换非结构化表。
  • A CNN or CRF is suited for data like this. CNN或CRF适合此类数据。

I know it's an opinionated question (and that this cannot be done overnight) but I would appreciate any cues! 我知道这是一个自以为是的问题(而且这不可能一done而就),但是我希望您能提出任何建议!

If the original PDF file comes in table format, I would suggest using table extraction because that will be the most reliable way to ensure you get the correct fields, based on the information you shared above. 如果原始PDF文件采用表格格式,我建议您使用表格提取,因为这将是根据上面共享的信息来确保获得正确字段的最可靠方法。

A CNN or CRF seems like overkill to me for such a simple example. 对于这样一个简单的例子,对于我来说,CNN或CRF似乎有点过头了。 A simple decision tree or any off-the-shelf supervised ML approach would probably suffice (again, based on the example you share above). 一个简单的决策树或任何现成的监督ML方法就足够了(同样,基于您在上面共享的示例)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM