简体繁体中英

Moving away from simple regex extraction to NER?

原文 2022-04-19 14:08:07 2 1 python/ amazon-web-services/ named-entity-recognition/ amazon-textract

We have a relatively "simple" project from the business: digitize some contracts scan (PDF files) with OCR and extract entities from the text.

Entities can be something as simple as a specific price located in a certain subsection of the contract, or a generic definition of a process which can be found eg somewhere around section 5. For the same entity, different formulations and languages are used interchangeably in different contracts.

We have a limited amount of examples (10 to 20 per entity) to develop the extraction algorithm.

Given the specific nature of every entity for the moment we created many functions which act on strings extracted by amazon-textract from the PDFs and use regex rules plus some additional tinkering of the results to get the things we need.

This is the best solution so far for immediate results but it's quite hard to modify in case something is not working. Furthermore, to improve results only someone with knowledge of the code can intervene and modify it by basically introducing a new or in the regex rule. And this is still quite annoying because we have to go back to the code and see where things are not working. Of course this is far from ideal.

I thought about using a Named Entity Recognition (NER) model trained by the input of users who could highlight the entities directly in the text, but given the limited training set is it even possible to use a similar method? I'm under the impression that, to have a consistent model, we need at least 100 examples per entity.

Is there any cleverer alternative to use just regex? Or in general how you think our pipeline could be improved?

1 answers

Caveat - Hacky way!!

Duplicate the dataset Annotate until it reaches 100 since that's the limit for AWS. Create a CSV file, and feed it to textract. Train the model

Big Query Regex Extraction

What changes are required when moving simple synapsesql implementation from Spark 2.4.8 to Spark 3.1.2?

Simple regex matching in BigQuery not working

Maintaining FIFO while moving from Sidekiq to Shoryuken

BigQuery JSON Array extraction

Moving data from RDS to S3 using Glue

Calculate a 3-month moving average from non-aggregated data

Moving files from one bucket to another in GCS

How to submit the simple log with AWS CloudWatch Logs from frontend?

Moving Files from Azure blob storage to Google cloud storage bucket

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Big Query Regex Extraction What changes are required when moving simple synapsesql implementation from Spark 2.4.8 to Spark 3.1.2? Simple regex matching in BigQuery not working Maintaining FIFO while moving from Sidekiq to Shoryuken BigQuery JSON Array extraction Moving data from RDS to S3 using Glue Calculate a 3-month moving average from non-aggregated data Moving files from one bucket to another in GCS How to submit the simple log with AWS CloudWatch Logs from frontend? Moving Files from Azure blob storage to Google cloud storage bucket

Related Tags

Moving away from simple regex extraction to NER?

Question

1 answers

solution1 0 2022-04-28 15:44:54

solution1
0 2022-04-28 15:44:54