简体   繁体   中英

Moving away from simple regex extraction to NER?

We have a relatively "simple" project from the business: digitize some contracts scan (PDF files) with OCR and extract entities from the text.

Entities can be something as simple as a specific price located in a certain subsection of the contract, or a generic definition of a process which can be found eg somewhere around section 5. For the same entity, different formulations and languages are used interchangeably in different contracts.

We have a limited amount of examples (10 to 20 per entity) to develop the extraction algorithm.

Given the specific nature of every entity for the moment we created many functions which act on strings extracted by amazon-textract from the PDFs and use regex rules plus some additional tinkering of the results to get the things we need.

This is the best solution so far for immediate results but it's quite hard to modify in case something is not working. Furthermore, to improve results only someone with knowledge of the code can intervene and modify it by basically introducing a new or in the regex rule. And this is still quite annoying because we have to go back to the code and see where things are not working. Of course this is far from ideal.

I thought about using a Named Entity Recognition (NER) model trained by the input of users who could highlight the entities directly in the text, but given the limited training set is it even possible to use a similar method? I'm under the impression that, to have a consistent model, we need at least 100 examples per entity.

Is there any cleverer alternative to use just regex? Or in general how you think our pipeline could be improved?

Caveat - Hacky way!!

Duplicate the dataset Annotate until it reaches 100 since that's the limit for AWS. Create a CSV file, and feed it to textract. Train the model

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM