简体繁体 English

如何OCR电子邮件地址

[英]How to OCR email address

原文 2014-10-30 06:17:28 1 2 c#/ image-processing/ ocr/ tesseract/ emgucv

I am trying to OCR and extract the email form the images. 我正在尝试OCR并从图像中提取电子邮件。 The images are supposed to have one line of text which is the email address. 图片应该包含一行文本，即电子邮件地址。

I am using EmguCV.OCR to extract the text (email address) from those images. 我正在使用EmguCV.OCR从这些图像中提取文本（电子邮件地址）。 The target is to have 100% accurate result. 目标是获得100％准确的结果。

We can fix the font and size of the text. 我们可以修复文本的字体和大小。 For example Ariel, 12pt, so that all the images will have email written in Ariel 12pt with black on white background. 例如，Ariel，12pt，这样所有图像都将以Ariel 12pt的电子邮件发送，白色背景上为黑色。

The problem is that Tesseract OCR in EmguCV is not recognizing the text properly. 问题在于EmguCV中的Tesseract OCR无法正确识别文本。 It recognizes only 80% of the characters accurately. 它只能准确识别80％的字符。

I am using preprocessing with Leptonica library. 我正在使用Leptonica库进行预处理。

Here are some sample images I am trying to recognize. 这是我尝试识别的一些示例图像。 在此处输入图片说明

Is there any way to achieve the target of 100% accuracy 有什么办法可以达到100％准确性的目标

2 个解决方案

With those sample images I can suggest two ways to solve the same problem. 通过这些样本图像，我可以提出两种解决同一问题的方法。 In those images JPEG artifacts are present ( the result of lossy compression ). 在那些图像中，存在JPEG伪像（有损压缩的结果）。 Because of this, the letters are becoming connected to each other (zoom in on the image in a program where you can see the actual pixels, windows photo viewer worked fine for me). 因此，这些字母变得彼此连接（在可以查看实际像素的程序中放大图像，Windows Photo Viewer对我来说很好用）。 TesseractOCR relies on spacing between letters (it uses connected components) to do character recognition. TesseractOCR依靠字母之间的间距（它使用连接的组件）来进行字符识别。 Have any pieces connected throws off the recognition process which means it tries to recognize the combination of "co" as one letter. 连接任何部件都会引发识别过程，这意味着它试图将“ co”的组合识别为一个字母。

Two possible solutions: 两种可能的解决方案：

I'm not sure what preprocessing steps are already being done, but you'll want to do some thresholding to removing the lighter shades on the image (disconnecting the characters). 我不确定已经完成了哪些预处理步骤，但是您将需要做一些阈值处理以去除图像上的较浅阴影（断开字符）。 However, you have to be careful with this as it may remove more than what you want. 但是，您必须对此谨慎，因为它可能会去除您想要的东西。
If at any time during this process you have a higher resolution image, or a non-jpeg/lossy format (ie png), then keep it in this format as you do other processing steps. 如果在此过程中的任何时候，您都有较高分辨率的图像或非jpeg /有损格式（即png），则请按照其他处理步骤将其保留为该格式。 Try to avoid any lossy compression that might happen. 尝试避免可能发生的任何有损压缩。 It sounds like these images don't come to you as shown above. 听起来这些图像没有像上图所示。 This is the preferable solution as you wont risk losing too data. 这是首选的解决方案，因为您不会冒险丢失太多数据。

I tried to recognize your images with ABBYY Cloud OCR SDK and got 100% accuracy. 我尝试使用ABBYY Cloud OCR SDK识别您的图像，并获得100％的准确性。 You can use Demo Tool to make sure of recognition accuracy. 您可以使用演示工具来确保识别准确性。

I work for ABBYY and can give you more information about our technologies if you need. 我为ABBYY工作，可以根据需要为您提供有关我们技术的更多信息。

OCR结果