简体繁体 English

从简单的正则表达式提取转向 NER？

[英]Moving away from simple regex extraction to NER?

原文 2022-04-19 14:08:07 3 1 python/ amazon-web-services/ named-entity-recognition/ amazon-textract

We have a relatively "simple" project from the business: digitize some contracts scan (PDF files) with OCR and extract entities from the text.我们有一个相对“简单”的业务项目：使用 OCR 将一些合同扫描（PDF 文件）数字化并从文本中提取实体。

Entities can be something as simple as a specific price located in a certain subsection of the contract, or a generic definition of a process which can be found eg somewhere around section 5. For the same entity, different formulations and languages are used interchangeably in different contracts.实体可以是简单的东西，例如位于合同的某个小节中的特定价格，或者可以在第 5 节附近的某处找到的流程的通用定义。对于同一实体，不同的表述和语言可在不同的地方互换使用合同。

We have a limited amount of examples (10 to 20 per entity) to develop the extraction algorithm.我们有有限数量的示例（每个实体 10 到 20 个）来开发提取算法。

Given the specific nature of every entity for the moment we created many functions which act on strings extracted by amazon-textract from the PDFs and use regex rules plus some additional tinkering of the results to get the things we need.鉴于目前每个实体的特定性质，我们创建了许多函数，这些函数作用于amazon-textract从 PDF 中提取的字符串，并使用regex规则加上对结果的一些额外修改以获得我们需要的东西。

This is the best solution so far for immediate results but it's quite hard to modify in case something is not working.这是迄今为止获得立竿见影效果的最佳解决方案，但很难修改以防万一出现问题。 Furthermore, to improve results only someone with knowledge of the code can intervene and modify it by basically introducing a new or in the regex rule.此外，为了改进结果，只有了解代码的人才能通过在正则表达式规则中基本上引入新的or来干预和修改它。 And this is still quite annoying because we have to go back to the code and see where things are not working.这仍然很烦人，因为我们必须 go 返回代码并查看哪里有问题。 Of course this is far from ideal.当然，这远非理想。

I thought about using a Named Entity Recognition (NER) model trained by the input of users who could highlight the entities directly in the text, but given the limited training set is it even possible to use a similar method?我想过使用命名实体识别 (NER) model 由用户输入训练，可以直接在文本中突出显示实体，但鉴于训练集有限，是否有可能使用类似的方法？ I'm under the impression that, to have a consistent model, we need at least 100 examples per entity.我的印象是，要获得一致的 model，每个实体至少需要 100 个示例。

Is there any cleverer alternative to use just regex?有没有更聪明的选择来使用正则表达式？ Or in general how you think our pipeline could be improved?或者总体而言，您认为我们的管道可以如何改进？

1 个解决方案

Caveat - Hacky way!!警告 - Hacky 方式！

Duplicate the dataset Annotate until it reaches 100 since that's the limit for AWS.复制数据集 Annotate 直到它达到 100，因为这是 AWS 的限制。 Create a CSV file, and feed it to textract.创建一个 CSV 文件，并将其提供给 textract。 Train the model培训model