如何识别作为光学字符识别 (OCR) 输出的文本中的实体？

Question

I am trying to do multi-class classification with textual data.我正在尝试使用文本数据进行多类分类。 Problem I am facing that I have unstructured textual data.我面临的问题是我拥有非结构化的文本数据。 I'll explain the problem with an example.我会用一个例子来解释这个问题。 consider this image for example:以这张图片为例：

I want to extract and classify text information given in image.我想提取和分类图像中给出的文本信息。 Problem is when I extract information OCR engine will give output something like this:问题是当我提取信息时，OCR 引擎会给出如下输出：

18
EURO 46
KEEP AWAY
FROM FIRE
MADE IN CHINA
2226249917581
7412501
DOROTHY
PERKINS

Now target classes here are:现在这里的目标类是：

18 -> size
EURO 46 -> price
KEEP AWAY FROM FIRE -> usage_instructions
MADE IN CHINA -> manufacturing_location
2226249917581 -> product_id
7412501 -> style_id
DOROTHY PERKINS -> brand_name

Problem I am facing is that input text is not separable, meaning "multiple lines can belong to same class" and there can be cases where "single line can have multiple classes".我面临的问题是输入文本不可分离，这意味着“多行可以属于同一个类”，并且可能存在“单行可以有多个类”的情况。

So I don't know how I can split/merge lines before passing it to classification model.所以我不知道如何在将行传递给分类模型之前拆分/合并行。
Is there any way using NLP I can split paragraph based on target class.有什么方法可以使用 NLP 我可以根据目标类拆分段落。 In other words given input paragraph split it based on target labels.换句话说，给定输入段落根据目标标签对其进行拆分。

Answer 1

If you only consider the text, this is a Named Entity Recognition (NER) task.如果只考虑文本，这是一个命名实体识别 (NER) 任务。

What you can do is train a Spacy model to NER for your particular problem .您可以做的是针对您的特定问题训练一个 Spacy 模型到 NER 。

Here is what you will need to do:以下是您需要执行的操作：

First gather a list of training text data首先收集训练文本数据列表
Label that data with corresponding entity types用相应的实体类型标记该数据
Split the data into training set and testing set将数据拆分为训练集和测试集
Train a model with Spacy NER using training set使用训练集使用 Spacy NER 训练模型
Score the model using the testing set使用测试集对模型进行评分
... ...
Profit!利润！

See Spacy documentation on training specific NER models请参阅有关训练特定 NER 模型的 Spacy 文档

Good luck!祝你好运！

如何识别作为光学字符识别 (OCR) 输出的文本中的实体？

问题描述

1 个解决方案

解决方案1
5 已采纳 2019-03-05 13:21:48

如何识别作为光学字符识别 (OCR) 输出的文本中的实体？

问题描述

1 个解决方案

解决方案1 5 已采纳 2019-03-05 13:21:48

解决方案1
5 已采纳 2019-03-05 13:21:48