[英]How to recognize entities in text that is the output of optical character recognition (OCR)?
I am trying to do multi-class classification with textual data.我正在尝试使用文本数据进行多类分类。 Problem I am facing that I have unstructured textual data.
我面临的问题是我拥有非结构化的文本数据。 I'll explain the problem with an example.
我会用一个例子来解释这个问题。 consider this image for example:
以这张图片为例:
I want to extract and classify text information given in image.我想提取和分类图像中给出的文本信息。 Problem is when I extract information OCR engine will give output something like this:
问题是当我提取信息时,OCR 引擎会给出如下输出:
18
EURO 46
KEEP AWAY
FROM FIRE
MADE IN CHINA
2226249917581
7412501
DOROTHY
PERKINS
Now target classes here are:现在这里的目标类是:
18 -> size
EURO 46 -> price
KEEP AWAY FROM FIRE -> usage_instructions
MADE IN CHINA -> manufacturing_location
2226249917581 -> product_id
7412501 -> style_id
DOROTHY PERKINS -> brand_name
Problem I am facing is that input text is not separable, meaning "multiple lines can belong to same class" and there can be cases where "single line can have multiple classes".我面临的问题是输入文本不可分离,这意味着“多行可以属于同一个类”,并且可能存在“单行可以有多个类”的情况。
So I don't know how I can split/merge lines before passing it to classification model.所以我不知道如何在将行传递给分类模型之前拆分/合并行。
Is there any way using NLP I can split paragraph based on target class.有什么方法可以使用 NLP 我可以根据目标类拆分段落。 In other words given input paragraph split it based on target labels.
换句话说,给定输入段落根据目标标签对其进行拆分。
If you only consider the text, this is a Named Entity Recognition (NER) task.如果只考虑文本,这是一个命名实体识别 (NER) 任务。
What you can do is train a Spacy model to NER for your particular problem .您可以做的是针对您的特定问题训练一个 Spacy 模型到 NER 。
Here is what you will need to do:以下是您需要执行的操作:
See Spacy documentation on training specific NER models请参阅有关训练特定 NER 模型的 Spacy 文档
Good luck!祝你好运!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.