[英]Information retrieval from unstructured text files by machine learning
So I have a bunch of .txt files that are extracts of PDFs as strings like so: 因此,我有一堆.txt文件,它们是像这样的字符串作为PDF的一部分:
---
Name:
ID Number:
--
CONFIDENTIAL
.
Date:
Description:
Foo Bar
ABC456789
THIS PAGE INTENTIONALLY LEFT BLANK.
05/04/17
Lorem ipsum dolor sit amet
Among all this noise, I would like to extract a couple target fields and ignore the rest of the information: 在所有这些噪音中,我想提取几个目标字段,而忽略其余信息:
Name: Foo Bar
ID Number: ABC456789
Date: 05/04/17
Description: Lorem ipsum dolor sit amet
So most of the documents I am dealing with have the same format, therefore so far, it was possible to make note of the line numbers at which the target values appear and save those. 因此,我要处理的大多数文档都具有相同的格式,因此到目前为止,可以记录出现目标值的行号并将其保存。 Of course, this is a crude solution because there are various formats that will be parsed differently to .txt.
当然,这是一个粗略的解决方案,因为有多种格式将与.txt解析不同。 It seems like it would be possible to extract information through machine learning, since I have done a lot of this by hand and therefore have sufficient training data.
似乎有可能通过机器学习来提取信息,因为我手工完成了很多工作,因此有足够的训练数据。 And any new file format that comes up, I can manually train that also.
以及出现的任何新文件格式,我也可以手动进行培训。 For a given ML algorithm, how would you supervise it and supply it this pattern?
对于给定的ML算法,您将如何监督它并提供这种模式?
Some ideas that I have you could challenge: 我有一些想法可以挑战:
I know it's an opinionated question (and that this cannot be done overnight) but I would appreciate any cues! 我知道这是一个自以为是的问题(而且这不可能一done而就),但是我希望您能提出任何建议!
If the original PDF file comes in table format, I would suggest using table extraction because that will be the most reliable way to ensure you get the correct fields, based on the information you shared above. 如果原始PDF文件采用表格格式,我建议您使用表格提取,因为这将是根据上面共享的信息来确保获得正确字段的最可靠方法。
A CNN or CRF seems like overkill to me for such a simple example. 对于这样一个简单的例子,对于我来说,CNN或CRF似乎有点过头了。 A simple decision tree or any off-the-shelf supervised ML approach would probably suffice (again, based on the example you share above).
一个简单的决策树或任何现成的监督ML方法就足够了(同样,基于您在上面共享的示例)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.