简体   繁体   English

从简单的正则表达式提取转向 NER?

[英]Moving away from simple regex extraction to NER?

We have a relatively "simple" project from the business: digitize some contracts scan (PDF files) with OCR and extract entities from the text.我们有一个相对“简单”的业务项目:使用 OCR 将一些合同扫描(PDF 文件)数字化并从文本中提取实体。

Entities can be something as simple as a specific price located in a certain subsection of the contract, or a generic definition of a process which can be found eg somewhere around section 5. For the same entity, different formulations and languages are used interchangeably in different contracts.实体可以是简单的东西,例如位于合同的某个小节中的特定价格,或者可以在第 5 节附近的某处找到的流程的通用定义。对于同一实体,不同的表述和语言可在不同的地方互换使用合同。

We have a limited amount of examples (10 to 20 per entity) to develop the extraction algorithm.我们有有限数量的示例(每个实体 10 到 20 个)来开发提取算法。

Given the specific nature of every entity for the moment we created many functions which act on strings extracted by amazon-textract from the PDFs and use regex rules plus some additional tinkering of the results to get the things we need.鉴于目前每个实体的特定性质,我们创建了许多函数,这些函数作用于amazon-textract从 PDF 中提取的字符串,并使用regex规则加上对结果的一些额外修改以获得我们需要的东西。

This is the best solution so far for immediate results but it's quite hard to modify in case something is not working.这是迄今为止获得立竿见影效果的最佳解决方案,但很难修改以防万一出现问题。 Furthermore, to improve results only someone with knowledge of the code can intervene and modify it by basically introducing a new or in the regex rule.此外,为了改进结果,只有了解代码的人才能通过在正则表达式规则中基本上引入新的or来干预和修改它。 And this is still quite annoying because we have to go back to the code and see where things are not working.这仍然很烦人,因为我们必须 go 返回代码并查看哪里有问题。 Of course this is far from ideal.当然,这远非理想。

I thought about using a Named Entity Recognition (NER) model trained by the input of users who could highlight the entities directly in the text, but given the limited training set is it even possible to use a similar method?我想过使用命名实体识别 (NER) model 由用户输入训练,可以直接在文本中突出显示实体,但鉴于训练集有限,是否有可能使用类似的方法? I'm under the impression that, to have a consistent model, we need at least 100 examples per entity.我的印象是,要获得一致的 model,每个实体至少需要 100 个示例。

Is there any cleverer alternative to use just regex?有没有更聪明的选择来使用正则表达式? Or in general how you think our pipeline could be improved?或者总体而言,您认为我们的管道可以如何改进?

Caveat - Hacky way!!警告 - Hacky 方式!

Duplicate the dataset Annotate until it reaches 100 since that's the limit for AWS.复制数据集 Annotate 直到它达到 100,因为这是 AWS 的限制。 Create a CSV file, and feed it to textract.创建一个 CSV 文件,并将其提供给 textract。 Train the model培训model

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 大查询正则表达式提取 - Big Query Regex Extraction 将简单的 synapsesql 实现从 Spark 2.4.8 迁移到 Spark 3.1.2 时需要进行哪些更改? - What changes are required when moving simple synapsesql implementation from Spark 2.4.8 to Spark 3.1.2? BigQuery 中的简单正则表达式匹配不起作用 - Simple regex matching in BigQuery not working 从 Sidekiq 转移到 Shoryuken 时保持 FIFO - Maintaining FIFO while moving from Sidekiq to Shoryuken BigQuery JSON 数组提取 - BigQuery JSON Array extraction 使用 Glue 将数据从 RDS 移动到 S3 - Moving data from RDS to S3 using Glue 根据非汇总数据计算 3 个月的移动平均值 - Calculate a 3-month moving average from non-aggregated data 在 GCS 中将文件从一个存储桶移动到另一个存储桶 - Moving files from one bucket to another in GCS 如何从前端使用 AWS CloudWatch Logs 提交简单日志? - How to submit the simple log with AWS CloudWatch Logs from frontend? 将文件从 Azure blob 存储移动到 Google 云存储桶 - Moving Files from Azure blob storage to Google cloud storage bucket
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM